52
Detection of Taan Sections in Khayal Vocal Concerts Submitted in partial fulfillment of the requirements of the degree of Master of Technology by Amruta J. Vidwans (Roll no. 123076005) Supervisor: Prof. Preeti S. Rao Department of Electrical Engineering Indian Institute of Technology Bombay 2015

Detection of Taan Sections in Khayal Vocal Concerts · 2021. 1. 13. · of the various sections, although the performance typically follows an established structure de-pending on

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Detection of Taan Sections inKhayal Vocal Concerts

    Submitted in partial fulfillment of the requirements

    of the degree of

    Master of Technology

    by

    Amruta J. Vidwans

    (Roll no. 123076005)

    Supervisor:

    Prof. Preeti S. Rao

    Department of Electrical Engineering

    Indian Institute of Technology Bombay

    2015

  • Amruta J. Vidwans/ Prof. Preeti S. Rao (Supervisor): “Detection of Taan Sections in

    Khayal Vocal Concerts”, MTech. Degree Dissertation, Department of Electrical Engineer-

    ing, Indian Institute of Technology Bombay, July 2015.

    Abstract

    Structural segmentation of concert audio recordings is very useful in music navigation and au-

    tomatic summarization. It is particularly strongly indicated for Indian classical music where

    concerts can extend for hours, and commercial audio recordings rarely provide timing details

    of the various sections, although the performance typically follows an established structure de-

    pending on the genre. The distinct concert sections have contrasting rhythmic, and sometimes

    melodic, structures. The proposed work is concerned with the automatic segmentation of spe-

    cific musical sections from the audio of Khayal vocal music concerts. The taan section, has a

    distinct melodic character across concert, irrespective of the tempo. Our goal is to label the

    taan sections using acoustic features that capture the melodic style. The features are derived

    from low-level audio analysis including pitch and energy tracking of the singing voice.

    The proposed system does binary classification of frames into taan and non-taan classes.

    The posterior probability vectors obtained in the course of the statistical classification are used

    in grouping stage. A higher time-scale smoothing as befits the concert section detection motive

    is achieved by using change detection methods. The grouping stage uses heuristics derived from

    a study of musicians’ annotations of taan episodes on a concert data subset. We evaluate the

    system in two stages: by its frame level classification accuracy, and by reporting the number of

    (over-, under-segmentation, true) detected, false positive and false negative taan episodes after

    the grouping stage. We compare the results of our proposed method with an unsupervised seg-

    mentation method showing that the proposed method achieves superior results over a database

    of 96 concerts in terms of the giving fewer false positives.

    Index terms: audio segmentation, taan detection, multilinear perceptron (MLP),

    posterior probability, self-distance matrix (SDM), novelty score, Hindustani Clas-

    sical music, Khayal vocal concerts

    iii

  • Contents

    Dissertation Approval i

    Declaration of Authorship ii

    Abstract iii

    List of Figures vi

    List of Tables vii

    1 Introduction 1

    2 Literature Survey 3

    3 Segmentation System Overview 6

    3.1 Feature Extraction Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.2 Change Detection and Labeling Block . . . . . . . . . . . . . . . . . . . . . . . . 8

    4 Database Description 11

    4.1 Database Subsets for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.1.1 Database Subset of 32 Concerts . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.1.2 Database Subset of 96 Concerts . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.1.3 Database Subset of 22 Truncated Jasraj Concerts . . . . . . . . . . . . . . 13

    4.1.4 Musicians’ Annotation: 24 Concert Subset . . . . . . . . . . . . . . . . . . 14

    4.2 Khayal Concert Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.2.1 Khayal Concert Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.2.2 Taan Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.3 Annotation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    5 Pre-processing 22

    5.1 Singing Voice Detection (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5.2 Obtaining Finer Ground Truth within Taan Section (SAD Method) . . . . . . . 24

    iv

  • Contents CONTENTS

    6 Taan Section Detection 26

    6.1 Pre-processing / Melody Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.2.1 Pitch Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.2.2 Energy Fluctuation Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.3 Inspection of Variability of Features with Tempo . . . . . . . . . . . . . . . . . . 30

    6.4 Classification and Grouping using Posteriors . . . . . . . . . . . . . . . . . . . . . 31

    7 Experiments and Evaluation 33

    7.1 Evaluation on data Subset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    7.2 Experiments on data Subset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    7.3 Experiments on data Subset D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    8 Conclusion and Future Work 40

    Acknowledgements 44

    v

  • List of Figures

    3.1 Simplified Block Diagram for detecting sections in Khayal vocal concert . . . . . 6

    3.2 Block Diagram for the proposed system of detecting segments in Khayal vocal

    concert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.3 SDM and Novelty for sitar concert . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    4.1 Possible sequence of sections in Khayal vocal concert . . . . . . . . . . . . . . . . 16

    4.2 Rhythmogram of tabla onsets in a Khayal vocal concert . . . . . . . . . . . . . . 16

    4.3 Various sections in Khayal vocal concert . . . . . . . . . . . . . . . . . . . . . . . 17

    5.1 Algorithm marked Vocal (V) and Instrumental (I) boundaries . . . . . . . . . . . 23

    5.2 Example spectrogram of errors in SVD . . . . . . . . . . . . . . . . . . . . . . . . 24

    5.3 Example spectrogram of non-taan movement occurring in taan episode . . . . . 25

    6.1 Spectrogram of an akar taan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2 Pitch and energy contour of taan segment . . . . . . . . . . . . . . . . . . . . . . 27

    6.3 Spectrogram of a sargam taan segment . . . . . . . . . . . . . . . . . . . . . . . 27

    6.4 Spectrogram of a bol taan segment . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.5 The Energy Fluctuation Rate feature for an audio plotted (a)without texture

    window applied (b) texture window of 5sec with hop of 1sec . . . . . . . . . . . . 29

    6.6 Taan frame feature values for Bada Khayal and Chota Khayal . . . . . . . . . . 30

    6.7 Histogram of Euclidean distance between feature values to show tempo invariance

    of taan features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    7.1 Various scenarios that occur after grouping . . . . . . . . . . . . . . . . . . . . . 34

    7.2 Shows (a) ROC by thresholding the posterior values obtained from MLP (b)

    SDM+novelty+Grouping stages in one of the Subset B audio . . . . . . . . . . . 35

    7.3 Comparison of pitch contours obtained from PolyPDA and Melodia plug-in . . . 37

    7.4 Comparison of taan episodes detected after SDM+novelty+grouping using pos-

    teriors from (a)MLP and (b)GMM . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    7.5 Errors in taan episodes detection after SDM+novelty+grouping using posteriors

    from MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    vi

  • List of Tables

    4.1 Distribution of clips per artist and the gharana of the artists . . . . . . . . . . . 12

    4.2 Summary of database subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.3 Khayal concert sections with musical characteristics and acoustic correlates . . . 19

    6.1 Comparison of frame-wise accuracies of SVM and MLP with Precision and Recall

    values for each class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    7.1 Performance evaluation using conventional measures on (a) proposed system and

    (b) GMM based system of [1] after grouping at frame-level . . . . . . . . . . . . . 34

    7.2 Taan detection performance after grouping for 35 train and 22 test concert sce-

    nario (92% of taan detected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    7.3 Shows taan detection performance after grouping for (a) pitch extracted from

    Melodia plug-in (33% of taan detected) (b) pitch extracted from PolyPDA (82%

    of taan detected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    7.4 Shows taan detection performance after grouping using (a) MLP on data Subset

    D (80% of taan detected)(b) GMM on data Subset D (86% of taan detected) . . 38

    7.5 Frame level classification accuracy, precision and recall using MLP on data Subset

    D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    vii

  • Chapter 1

    Introduction

    Audio segmentation systems aim at representing the audio at a broader level via labeled musical

    sections. In case of Hindustani classical music there is much variability with respect to dura-

    tion of the concert, the number of repetitions of the segments, the freedom of improvisation,

    the school of music and the artist’s individuality of rendering the concert. The most popular

    performance in a Hindustani classical vocal concert is of the Khayal genre. The problem of

    segmentation in Indian Classical music has been approached using rhythmic features in the past

    for tabla concerts and a few Khayal vocal concerts [2], but lacks in evaluating the performance

    of the system or structural analysis as it gives only the visual representation. There are no

    papers that we know of addressing the problem of Khayal concert segmentation apart from [2].

    Attempts were made by [3] for the case of Carnatic music for classification of an audio into a

    section rather than segmentation within a concert, where the presence or absence of percussion

    was an important cue. Review of audio segmentation systems for different music styles is done

    in detail in Chapter 2

    The segmentation of Khayal concert holistically is possible by using other dimensions of the

    audio, i.e. pitch, energy and timbre along with rhythm. We study the musical and acoustic

    characteristics of various Khayal sections and possible features that can be used for them. For

    certain Khayal sections, like the taan section in particular, the nature of the pitch variation is

    the only cue that can be used for distinguishing it from other sections. We explore the possi-

    bility of taan detection and labeling in Hindustani classical Khayal vocal concerts by proposing

    features that distinguish taan from other sections. The main challenge is detection of taan sec-

    tions at a time scale meaningful to musicians. There should be mechanism to ignore taan like

    movements that may occur in other sections in Khayal performance. Taan section might occur

    multiple times in the Khayal performance and we need to retrieve all of them without inducing

    false positives. These are the few challenges of taan detection and labeling that we wish to

    address through this work. A general system for detecting sections of the Khayal vocal concerts

    is proposed. Since taan section can be distinguished using pitch alone, we need an automatic

    vocal pitch extraction algorithm. Pitch extraction algorithms are not yet fully automatic with

    around 80% accuracy possible. Since pitch extraction of the lead vocal artist is a crucial step for

    coming up with features for detecting sections in Khayal, one of the goals will also be to look into

    possibilities to improve the accuracy of the current state of the art pitch tracker [4] by improving

    one aspect of it, which is the Singing Voice Detection (SVD). The study of detection of sections

  • Chapter 1. Introduction 2

    and labeling system for the Khayal vocal concert with proposed features for the taan section

    and evaluation of the system is the main focus. Such a system will be helpful in audio browser

    like Dunya [5] which is dedicated for easy access and navigation of Hindustani classical concerts.

    It will be useful in fast navigation through Khayal concerts and for music summarization.

    The existing segmentation systems are discussed in Chapter 1. There is no audio segmen-

    tation system that works on Khayal concerts, but systems that work using a different database

    that might be applicable to our case are mentioned. We give an overview of the proposed sys-

    tem in Chapter 3. We discuss about the database that we have used for our study and give the

    specialities of sections in Khayal concerts in Chapter 4. The database was selected considering

    that all types of taan are present in the database and there is representation of almost all school

    of thoughts in Hindustani classical music. Pre-processing of the audios is required to obtain

    reliable pitch track corresponding to the vocal melody which is discussed in Chapter 5 followed

    by description of features for taan section of Khayal in Chapter 6. The experiments done at

    intermediate and final stage of the system and discussion of the results obtained is presented in

    Chapter 7 followed by conclusion in Chapter 8.

    2

  • Chapter 2

    Literature Survey

    A broad overview of methods available for structural segmentation is summarized in [6]. The

    primary aim of the methods described in the overview paper is to identify the various homo-

    geneous segments within an audio and in case of some papers labeling the segments as well.

    The papers reviewed in this article mainly have features that exploit one or group of aspects

    related to melody, harmony, rhythm and timbre. Only a few papers have used all these aspects

    of audio together for segmentation and the author states that this might be more effective way

    of approaching segmentation problem. The three criteria to approach the problem of segmen-

    tation across the papers are seen to be based on repetition, homogeneity, and novelty. Here the

    homogeneity and novelty based approaches give similar information with homogeneity based ap-

    proaches telling about the contents within a segment and novelty based approaches telling about

    the boundary between contrasting sections. None of the approaches were seen to be particularly

    superior to other and have been tried only on Western popular music where the structure is

    relatively standard. The author stresses on investigating the use of musically motivated features

    and different distance measures for frame level feature comparison.

    The paper by Turnbull et. al. [7] is important in the view that it has used all the aspects

    of the audio of timbre, melody, rhythm and harmony. They use a supervised approach where

    the features along with their first and second order difference as also their smoothed versions

    are passed to a boosted decision stump classifier. Here they train the classifier to give out-

    put as frames belonging to boundary or non-boundary class which is an unconventional way

    to approach the problem. Using approach of providing the information of change was seen to

    better for a supervised framework. Also the final features after feature selection were seen to

    be corresponding to all the aspects i.e. rhythm, timbre, melody and harmony thus stressing the

    importance of using all the dimensions of music for effective segmentation.

    The paper [8] deals with applying HMM based methods for the case of music segmentation

    problem. The author mentions that the method is likely to succeed on recordings made with

    modern production techniques, particularly where copy and paste has been used to produce mul-

    tiple segments of the same type. However the noteworthy aspects which can be mentioned are:

    the use of adaptive hop and window size (for feature calculation) depending on the beat interval

    and use of histogram of states obtained from the Hidden Markov Model (HMM) to determine

    the labels. The segmentation is performed on western music audios with sub-band energy in

    3

  • Chapter 2. Literature Survey 4

    logarithmically spaced bands as feature. A beat tracking algorithm is used to track the beat

    location which are used as a hop size with the frame size also changing dynamically according to

    the inter-beat interval. This can be useful in Khayal vocal concert segmentation as the tempo

    keeps increasing gradually over the concert. Thus we might want to use similar functionality

    for feature calculation. An HMM is trained with fairly large number of states on the features.

    During training each state, the output is assumed to have a single Gaussian distribution. After

    getting the state probabilities for each of the states, Viterbi algorithm is used to maximize the

    probability of the observed data. The best path output from the Viterbi algorithm is used to

    do segmentation of the sections. Thus now for every audio, the audio waveform is transformed

    into a time series with a specific value for every beat position. The authors have clearly depicted

    the need for the correct choice of the HMM states in bringing out specific patterns in the au-

    dio. Every HMM state will not correspond to a particular segment in the audio structure but a

    collection of states will contribute to it. Thus at each of the beat positions the histogram of the

    HMM states is computed. Manual segmentation is used as a reference template for histogram

    distribution and then for each of the beat locations the nearest reference template is assigned as

    label for that beat.

    Another paper that uses various dimensions of the audio such as timbre, rhythm, harmony,

    energy via MFCC, rhythmogram, chroma, short time energy representations is [9]. This paper

    emphasizes the performance improvement obtained in the segmentation if we have multi-track

    audio data. The main idea is that each of the features may get affected in polyphonic recordings

    which is generally unaffected in separate recordings. They also start with the beat tracking

    algorithms and then assign, for each of the beat positions a feature vector. Multiple features

    are extracted for each of the tracks like 13 MFCC, 12 chroma vector, 200 rhythmogram based

    and 1 RMS energy based features. For each track we get 226 features and for each sub-category

    of features, SDM is computed separately. Each of these SDMs is convolved with appropriate

    kernel to obtain the novelty score. In order to get the best performance, each of the SDMs

    computed are given different weights out of 1, 10, 100. A final SDM is obtained after addition

    of individual SDMs and then its performance is computed for different combination of weights.

    For a particular genre, the weights which give the best performance are chosen as the final model.

    In case of Hindustani classical music, segmentation problem was tried using rhythm as a

    feature by [2] for Khayal vocal concert as well as tabla concert. In Hindustani music the changes

    in rhythm can be perceived when the tabla performer improvises from the basic rhythmic cycle

    or when the rhythmic cycle itself changes. Tabla onsets are detected, by processing in the band

    of prominence of tabla strokes using onset detection function, thus giving a representation of

    rhythm. The visual representation of the feature using a self distance matrix reveals the rhyth-

    mic structure of tabla solo concert, having tabla as the lead instrument, and gives boundaries of

    the major segments of the Khayal vocal concert. Spectral features combined with the auditory

    processing motivated bi-phasic function achieved good time localization of onsets for polyphonic

    audio. The changes in self distance matrix are detected by correlating a checker board kernel

    of width relevant for desired temporal resolution across its diagonal to get the novelty score

    depicting the changes. Here, however, evaluation of the system was not presented and it was

    mentioned in the paper that the subsections within the Khayal vocal concerts cannot be seg-

    4

  • Chapter 2. Literature Survey 5

    mented using rhythm alone.

    Verma et. al.[1] have worked with Hindustani instrumental concerts and Dhrupad vocal

    performances and have combined various features dealing with different musical aspects of the

    audio like the chroma, tempo and energy based features. They have proposed conversion of

    features into their posterior probabilities to make them robust to local variations. Limitation of

    their work is that they use a dataset which has little to no variability in the number of sections

    and skipping or repetition of sections in the concerts is not seen.

    Ranjani et. al [3] carried out classification of Carnatic audios into their musical forms. They

    do not perform segmentation but perform labeling of audio which represents a segment. Hierar-

    chical classification was used to label the segment to which the given audio belongs using absence

    or presence of rhythm as a major cue. This can be used in Khayal vocal concert segmentation

    task after obtaining the boundaries to achieve final label names using the feature value distri-

    butions.

    Since our task involves the detection and segmentation of a specific named section of the

    concert, we need to invoke both segmentation and supervised classification methods. Musically

    motivated features and methods are our chosen approach given their potential for success with

    limited training data [10]. The challenges to taan detection are the polyphonic setting where we

    want to focus on the vocal signal, and designing distinctive features that are artist and concert

    independent. Given that pitch modulations are the prime characteristic of taan, reliable pitch

    detection with sufficient time resolution is necessary. Finally, we need to convert the low level

    analyses to annotation that closely matches the musicians’ labeling of taan episodes from a per-

    formers point of view. Towards these goals, we use a vocal source separation algorithm based

    on predominant-F0 detection [4]. Features designed to capture the characteristic of rapid but

    regular pitch and energy variations of the voice are presented. A frame level classification at 1sec

    granularity is followed by a grouping stage with the goal of emulating the subjective labeling of

    taan by musicians as extended regions that occur at salient positions in the concert.

    5

  • Chapter 3

    Segmentation System Overview

    Figure 3.1: Simplified Block Diagram for detecting sections in Khayal vocal concert

    We want to segment out different sections in a Khayal vocal concert using their acoustic

    characteristics as discussed before. The final aim is to get correct section boundaries along

    with their labels. For that we propose modules as seen in the block diagram of Figure 3.1

    consisting of two main blocks, viz. feature extraction block and change detection and labeling

    block. The feature extraction block deals with observation of acoustic characteristics and using

    the existing features or coming up with new features which will bring out distinction between

    various Khayal vocal concert segments. The pre-processing block will process the audio and give

    it in simplified form to perform feature extraction. For example, if features are to be extracted

    on vocal pitch alone, then pre-processing will perform vocal pitch extraction, if rhythm related

    features are needed then tabla strokes enhanced in a sub-band will be provided. The change

    detection and labeling block deals with means to come up with boundaries between sections

    using the existing framework of self-distance matrix (SDM) and the subsequent modules which

    seem to be suitable for the problem. The challenging aspect is to come up with section names

    and detecting their recurrence. The detailed block diagram is given in Figure 3.2

    6

  • Chapter 3. Segmentation System Overview 7

    Figure 3.2: Block Diagram for the proposed system of detecting segments in Khayal vocalconcert

    3.1 Feature Extraction Block

    This block includes pre-processing step followed by the feature extraction step. Each of them is

    explained in the following sections.

    3.1.1 Pre-processing

    Since vocal pitch is the primary cue for various sections, we need to extract it reliably to further

    compute features on it. Also tabla onsets might be a cue for broad level sectioning of Alap,

    Bada and Chota Khayal sections, so we need source and instrument sound isolation to extract

    features separately on them. The pitch extraction of the dominant source is first carried out

    using the technique of [4]. The assumption is that the vocal melody being in the lead will be

    having the strongest contribution to the spectrum. Thus by extracting the predominant F0, it

    is expected to get melody corresponding to the voice where both voice and Secondary Melodic

    Instrument (SMI) are present. This is further used in dominant source spectrum isolation block

    to isolate the dominant source spectrum by reliably extracting sinusoidal partials using main-

    lobe matching technique. Line spectra are obtained at each analysis frames by searching in the

    vicinity at multiples of location of the detected pitch.

    Features are extracted on the source isolated spectral envelopes as they are less dependent

    on the pitch of the source and represent the source timbre better. The feature extraction block

    as seen uses static and dynamic timbral features as also dynamic F0-harmonic features as in

    [11]. Using only selected features for the genre under consideration i.e. Hindustani classical,

    and applying unsupervised classification algorithm like k-means clustering, labels are obtained

    for the frames as belonging to either the voice or the instrumental regions. An unsupervised

    7

  • Chapter 3. Segmentation System Overview 8

    algorithm is used as the nature of the acoustic characteristics is clearly different for voice and

    instrument regions within an audio. Further from here we can extract the pitch corresponding

    to the vocal regions alone to give it to the feature extraction block.

    The accuracy for the Singing Voice Detection (SVD) using the unsupervised classification

    algorithm of k-means clustering and genre specific features was seen to be 83% in a previous

    study. Further experimentation and evaluation on the database in consideration for this thesis

    is explained in the section 4.2. The pitch extraction algorithm by [4] is superior to others for the

    case of Indian classical music with minor occasional octave errors. Any other errors as maybe

    observed during the course of the work also need to be taken care of, to improve the pitch

    extraction so that reliable feature extraction is possible for detection of sections.

    3.1.2 Feature Extraction

    We can derive features for the purpose of detection of sections in Khayal vocal concert using the

    extracted vocal pitch, or the source isolated spectrum belonging to the voice alone as per what is

    suitable considering the acoustic characteristics of those Khayal sections. Other features based

    on the timbre or the rhythmic onsets of the tabla can be derived directly from the audio using

    pre-processing as per required. Acoustic characteristics and possible features for each section

    have been discussed in section 4.2. Here since we have chosen the features such that they are

    representative of a musical section, we expect them to be homogeneous within sections while

    being in contrast across sections. This contrast among feature values across sections can be

    captured by giving the features to the change detection and labeling block.

    3.2 Change Detection and Labeling Block

    This block is generic and can be applied to any set of features irrespective of the style of music,

    as the motive of the block is to detect change between contrasting sections. But for the purpose

    of labeling, some changes might be required. The first component in the change detection and

    labeling block is classification using features. The posterior probability values obtained from

    the classification can be used for the computation of Self-Distance Matrix (SDM) and novelty

    score. The SDM can be computed on the features directly or by transforming them to their

    posterior probability values. This transformation was seen to be effective in case of segmentation

    problem of Dhrupad vocal and instrumental alap audios [1] as seen in Figure 3.3 and hence can

    be investigated for use in Khayal vocal concert as well. In case of [1] the sections to be identi-

    fied are fixed in number as also they occur sequentially, hence the labeling of the sections was

    not a major problem. They have used unsupervised GMM classification to convert the features

    into posterior probability values, with the number of classes in the classifier to be equal to the

    number of sections. In case of detection of sections in Khayal vocal concert, though the number

    of sections are fixed, they do not occur sequentially and generally repeat over the concert in a

    way that can not be generalized. Also we need to finally label the sections which will be not

    straight-forward if we use unsupervised classification methods. Hence, the preferable way in this

    case to convert features into posteriors using a supervised classification algorithm and use the

    labels obtained for further post-processing. An SDM is computed on the features / posteriors

    8

  • Chapter 3. Segmentation System Overview 9

    using D(i, j) = d(xi, xj) for i, j�{1, 2, .., N} where distance function d specifies the distancebetween two frames xi and xj . Typically used distance measures are Euclidean distance or the

    element wise dot product [12]. SDM provides with visualization of how distinct the features are

    in different sections and which sections repeat. As seen in Figure 3.3 b clearly 3 blocks are seen

    corresponding to alap-jod -jhala in the SDM.

    Points of high contrast in the SDM can be detected by convolution along the diagonal with a

    checker-board kernel [12]. The dimensions of the kernel have to be decided depending upon the

    time scale of the section to be detected. A one-dimensional plot results from this convolution

    and is called a novelty score whose peaks indicate the contrasting points which might be at the

    locations of section boundaries due to suitable feature selection. The peaks in the novelty score

    might be closely spaced and spurious. The detection of peaks is thus done in the peak detection

    block using ‘local peak local neighbourhood’ search as proposed in [7]. Even after this process-

    ing, the novelty score can still have multiple peaks thus resulting in over-segmentation. At this

    point, we know the labels and hence we can name the sections between two peaks. Multiple

    sections can be further merged in the post-processing block using rules derived from musicians’

    annotations. For example, through observation of maximum duration between two same sections

    ignored by the musicians to merge the sections.

    Figure 3.3: SDM (top) and novelty functions (bottom) for a sitar concert computed by(a) acoustic feature vectors (b) posterior feature vectors as depicted by [1]

    The final peaks thus obtained will represent the boundaries of the sections and the region

    between them will have labels in case of supervised classification. In case of the unsupervised

    classification as proposed in [1], on the other hand, we would still have to determine the labels

    for the sections as also detect repetition of the sections. Since the features are musically derived,

    we can make use of the feature values along with the boundary points to come up with some

    rule based decision or hierarchical classification as done by [3]. Another approach could be using

    9

  • Chapter 3. Segmentation System Overview 10

    Hidden Markov Model (HMM) to detect repetition but at the level of section rather than at

    frame level [13]. While image processing based methods to detect repetitions have been reported

    in [6], the method in [3] seems to be attractive due to its simplicity.

    10

  • Chapter 4

    Database Description

    We have 102 concert audios of 24 artists from commercially available CDs which span number of

    gharanas of vocal music. The audios were stored at 16kHz sampling rate with 16 bit resolution

    and in single channel format. The total duration of the audios is 51 hours with 66 ragas covered

    in the recordings. The concert audio with highest duration is of 1hr and 3mins while smallest

    audio is of 15mins while the average audio duration is of 30mins. The duration of the concert

    sections in the audio will thus vary to a lot of extent. The number of recordings of artists and

    their gharana can be seen in Table 4.1. As can be seen there are a lot of artists belonging

    to more than one gharana. The database was chosen such that almost all the gharanas were

    covered in it. It was observed that all the artists in the database render akar taan . To keep

    the overall percentage of bol taan and sargam taan at par with the akar taan, the number of

    concerts of artist Jasraj, Ajoy Chakraborty and Kishori Amonkar were kept more as they were

    seen to take sargam and bol taan. Inspite of this, the percentage of sargam taan considering

    all the concerts is just 1.6% while the percentage of akar taan is 13.3% and that of bol taan is 3%.

    The audios were selected considering that they had all the major sections i.e. the unmetered

    alap, Bada Khayal, Chota Khayal and their improvisation sections. These divisions within the

    Khayal performance are explained in detail in section 4.2. As discussed earlier, the artists may

    skip sections or choose the tempo of the performance depending on the time constraints. Gen-

    erally, the taan section marks the end of a Bada Khayal and a Chota Khayal concert in any

    performance but an artist might just skip the taan section after the Bada Khayal and might

    render it directly at the end of the Chota Khayal. This was observed particularly in case of

    shorter duration audios of 15-20mins duration.

    4.1 Database Subsets for Evaluation

    Detailed experimentation has been performed using different specialized subsets within the

    database. The subsets created and their purpose is described as ahead with their summary

    in table 4.2.

    11

  • Chapter 4. Database Description 12

    Table 4.1: Distribution of clips per artist and the gharana of the artists

    Artist No. of Clips Gharana

    Aarti Ankalikar 2 Agra, Gwalior, AtrauliAshwini Bhide 1 Jaipur-AtrauliAjoy Chakraborti 12 PatialaAslam Khan 4 AgraDattatreya Velankar 2 Gwalior, KiranaGirija Devi 4 BanarasGauri Pathare 2 Jaipur, Gwalior, KiranaHirabai Barodekar 4 KiranaJasraj 24 MewatiJitendra Abhisheki 1 Agra, JaipurJayateerth Mevundi 1 KiranaJagdish Prasad 4 PatialaKishori Amonkar 9 Jaipur, Bhendi BazarKaushiki Chakraborti 4 PatialaKaivalyakumar Gurav 2 KiranaKumar Mardur 2 KiranaManik Bhide 2 Jaipur-AtrauliMani Prasid 2 KiranaMalini Rajurkar 3 GwaliorPrabha Atre 10 KiranaPrabhakar Karekar 2 Agra, GwaliorRaghunandan Panshikar 1 JaipurUlhas Kashalkar 2 Gwalior, Jaipur, AgraVeena Sahastrabuddhe 2 Gwalior, Jaipur, Kirana

    12

  • Chapter 4. Database Description 13

    4.1.1 Database Subset of 32 Concerts

    This contains an almost equal mix of male and female artists, and different ragas from the 102

    concert database. The total duration of the 32 concerts is of 17 hrs. This database is needed

    for evaluation of 3 tasks that need to be performed before proceeding to detection of taan:

    i) Singing Voice Detection (SVD):Taan features are derived from the melody alone and

    rely on the accuracy of the vocal melody extraction. To extract melody, we need to first identify

    the regions where vocal melody is present. This can be done using the SVD features as pro-

    posed by [11] which is considered to be state-of-the-art for Hindustani classical music. We need

    to evaluate the performance of the SVD algorithm and look into possible improvements that

    can be done. This subset data has been hence annotated for Vocal and Instrumental regions for

    SVD evaluation. The total duration of the annotated data is 17hrs out of which the duration of

    vocal regions is 12hrs. The output of the SVD algorithm can be compared with this annotated

    ground truth to obtain the accuracy.

    ii) Obtaining finer ground truth within taan section (SAD evaluation): If we ob-

    serve a taan episode, there are some non-taan movements also occurring between the taan move-

    ments as seen from Figure 5.3. For initial evaluation, we do not want the classifier accuracy to

    get affected by these non-taan regions coming between the taan section. In-order to facilitate

    this, the finer frame-level markings within the taan section were obtained automatically using

    the Speech Activity Detector (SAD)[14]. The evaluation of SAD was done on this subset data.

    iii) Inspection of tempo invariance of features: As described in the previous chapter,

    over the concert duration, gradually, the tempo keeps increasing. We want to inspect if the

    features proposed are invariant to the tempo variation. Generally an abrupt change in tempo is

    seen at the start of Chota Khayal section. The timing of this starting point of Chota Khayal can

    be used for this purpose. This has been addressed in detail in section 6.3

    4.1.2 Database Subset of 96 Concerts

    Among the 102 concerts, 32 concerts were evaluated for SVD accuracy. Nonetheless, taan sec-

    tions in all the concerts were checked to see if pitch is not missing for major parts in taan sections.

    It was observed that absence of pitch contour though vocal melody was present in the audio was

    occurring for 6 concerts among the 102 database. Detailed analysis of this is presented in the

    section 5.1.

    4.1.3 Database Subset of 22 Truncated Jasraj Concerts

    In the 96 concert database, 22 concerts belonging to artist Jasraj were used initially to observe

    the performance of the supervised classification algorithm for detection of taan section. The

    algorithm was subsequently tested on larger database of the 96 concerts. This subset was

    created to ease debugging of the various steps involved in detection of sections and to obtain

    optimal operating point to be used further on the larger database.

    13

  • Chapter 4. Database Description 14

    4.1.4 Musicians’ Annotation: 24 Concert Subset

    The end goal of this work is to mark the taan episodes which are meaningful to the musicians.

    It is important to obtain the taan section markings from different musicians and compare the

    differences to come up with the logic behind their annotation. The differences were mainly

    in the taan episode marking in terms of allowable vocal non-taan regions between consecu-

    tive taan sections or the instrumental improvisation between the taan sections. This can be

    quantified and used to obtain a simple heuristic, to achieve the highest level of grouping of

    taan sections, by examining the region of audio separating every two detected taan segments in

    the musicians’ annotation. For this purpose a concert of each artist was chosen to put together

    this subset dataset of 24 concerts. Taan boundary markings were obtained from 3 artists and

    observations were carried out on them. In general, among all the markings it was observed that

    the mukhada occurring at the end of a rhythmic cycle was combined in the taan episode. It

    was seen that at the most 10sec of vocal duration which corresponds to non-taan was combined

    in the taan episode. Also instrumental improvisation occurring between rhythmic cycles was

    considered as a part of taan if its duration was less than 50secs. These insights were used to

    combine consecutive taan segments.

    Table 4.2: Summary of database subsets

    DatabaseSubset

    Contents Purpose

    Subset A 32 concerts

    SVD and SAD evaluation,proving tempo invariance of features,comparison of taan detection performanceusing different pitch detection algorithms

    Subset B22 Jasraj concerts (testing)+35 other maleartists’ concerts (training)

    truncated concerts used forfinding the f-measure,initial testing of the system

    Subset C24 concerts(1 concert per artist)

    Musicians’ annotation forobtaining grouping heuristics

    Subset D 96 concertsEvaluation over the entire dataset,comparison of the proposed methodwith [1]

    4.2 Khayal Concert Details

    Khayal concert is most popular form in Hindustani classical vocal music and is based on the

    theme of a raga. A raga is not just a scale of allowed notes but also has the motifs associated

    with it. The vocal artist uses the raga as the theme of the performance and is accompanied by

    tabla for rhythm, harmonium or sarangi for melodic accompaniment and tanpura for keeping

    a reference to the tonic. The role of the melodic accompaniment is to follow the main melody

    while the role of tabla is timekeeping. The vocal artist is the main performer of the concert and

    he has the liberty to give a few minutes in the concert to the tabla and harmonium artist for

    showing their skills at improvisation.

    14

  • Chapter 4. Database Description 15

    4.2.1 Khayal Concert Structure

    The aim of Khayal is to elaborate on the idea of the raga via motifs and note by note elaboration

    as perceived by the performing artist. In a typical Khayal vocal concert the artist chooses a

    raga and starts introducing it through motifs in the alap . This introductory alap is not accom-

    panied by tabla i.e. the percussive accompaniment of the concert. The alap atleast lasts for half

    a minute and may also be rendered for more than 10mins depending on the gharana i.e. the

    school of the artist. This is followed by Bada Khayal composition and its improvisation section

    where the tabla sets in with a particular tala in a slow (vilambit laya) or medium (Madhya laya)

    tempo as per what is suitable for the bandish i.e. is the composition selected by the artist. The

    bandish generally comprises of four lines of poetry. At the start of the Bada Khayal the artist

    renders the first 2 lines, called the Sthayi, of the composition once or twice. The Sthayi is limited

    to the middle and lower register [15].

    After this the artist starts with the improvisation of the Bada Khayal by taking first the

    alap which can be rendered using the lyrics of the composition or vowel /a/. The next two lines

    of the composition, called the Antara, have melody in the second part of the middle octave and

    higher [15]. Hence depending on the raga elaboration followed by the artist, it can be taken

    when the artist is near those notes. After the alap the artist plays around with the rhythm and

    melody in the the baat section which can again be rendered using lyrics of the composition (bol),

    note names (sargam) or vowel /a/ (akar). This section may be present or absent depending on

    the artist. After baat , follows the taan section which can be rendered using again the lyrics, note

    names or vowel /a/. Rendering of the taan section marks the end of the Bada Khayal improvi-

    sation. An artist may sometimes prefer to not take taan section in Bada Khayal improvisation

    but that is rare. Then the concert takes on a faster tempo relative to Bada Khayal by moving on

    to Chota Khayal composition and its improvisation sections which are same as Bada Khayal i.e.

    the alap , baat and taan. Many times since the rhythm has entered fast tempo, the artists prefer

    to skip the alap and baat section after the rendering Chota Khayal composition and take the

    taan directly. Many a times the artists prefer to take a tarana instead of Chota Khayal but the

    improvisation sections remain the same as that of Chota Khayal. The tarana has no apparently

    meaningful lyrics and has words like ‘dirdir’, ‘tanana’, ‘dim’, ‘tom’, etc. as well as the bols of

    tabla sound. Individual acoustic and musical characteristics of various Khayal vocal concert

    sections is mentioned in Table 4.3 and their hierarchy is depicted in Figure 4.3. The sequence

    of different sections in the entire concert of 27mins can be seen in Figure 4.1

    The Figure 4.2 shows a rhythmogram which is a two dimensional time-pulse representation

    with lag-time on y- axis, time position on the x-axis and the auto-correlation values of onsets vi-

    sualized as intensity. The auto-correlation peaks give us an idea about the tempo in the concert.

    As can be seen in the figure, as the auto-correlation peaks get closer, it can be interpreted that

    the tempo is increasing. Throughout the concert, the tempo is seen to increase gradually within

    the Bada Khayal with an abrupt increase in tempo (maybe also change in tala and raga ) for

    Chota Khayal as seen in Figure 4.2. The tempo is not fixed for the Bada Khayal across concerts

    to a particular value. Many artists tend to take tempo as slow as just 10bpm to as fast as 40bpm

    which might seem like Madhya laya [16], while the drut laya ranges from 160-320bpm. The

    improvisation sections may not follow any particular order but the above mentioned sequence

    15

  • Chapter 4. Database Description 16

    Figure 4.1: Approximate sequence of different sections in Khayal vocal performance ofraga Shree by artist Kumar Mardur

    Figure 4.2: Rhythmogram of tabla onsets in raga Deshkar Khayal vocal performance byartist Kishori Amonkar as depicted in [2].

    is a general trend. The melody improvisation is gradual and note by note with the ‘mukhada’

    marking the end of a rhythmic cycle. The artist generally starts with improvising in the lower

    octave, then the middle octave and then advancing to the upper octave. Melodic movements do

    not span multiple octaves within a rhythmic cycle with the exception being the taan section,

    where the artist tries to show off his mastery over the voice. According to [17] the various sec-

    tions do not fall into rigid divisions but there might be occasional overlap between the sections.

    Since the sections according to musical context defer in proportion, placement and quality of

    impact, one section cannot be mistaken for other.

    4.2.2 Taan Section

    In this work, our focus is on segmenting taan sections that are melodically salient i.e. the

    sequence of melodic phrases or notes is rendered in a characteristic melodic style. The notes

    may be articulated in various ways including solfege (sargam taan) and the syllables of the lyrics

    (bol taan). Most common however is the akar taan , rendered using only the vowel /a/ (i.e.,

    16

  • Chapter 4. Database Description 17

    Figure 4.3: Various Sections in Khayal concert are depicted with the sequence being thealap without percussion followed by Bada Khayal and its improvisation component

    as melisma). The sequence of notes is relatively fast-paced and regular, produced as skillfully

    controlled pitch and energy modulations of the singers voice similar to vibrato. But unlike the

    use of vibrato which ornaments a single pitch position in Western music, the cascading notes of

    the taan sketch elaborate melodic contours like ascents and descents over several semitones. The

    melodic structure is strictly within the raga grammar while the step-like regularity in timing

    brings in a rhythmic element to the improvisation in contrast to the (also improvised) alap

    sections. Apart from showcasing the singer’s musical skills, one or more taan sections typically

    contribute to the climax of a raga performance and therefore serve as prominent musicological

    markers. These unique characteristics of the taan section motivate us to investigate it further

    for detection of this section. Table 4.3 provides with the musical and acoustic characteristics of

    the taan section. All the concerts may not have all the taan types. It was seen that the akar

    taan is rendered by artists over all the schools of music while the least rendered type was sargam

    taan . The minimum percentage of taan section in the database was found to be 3.9% while

    38% was the maximum taan percentage within a concert considering the total concert duration.

    On an average 18% of the concert is taan section. The minimum duration of the taan section

    is 5.6sec across all the concerts. Database for the task of detection of taan and pre-processing

    step evaluation is described in next section.

    4.3 Annotation Methodology

    Keeping in mind the musical characteristics mentioned in table 4.3 as also the acoustic charac-

    teristics we can mark the section in Praat [18] giving us the time instances of the start and end

    time of the section. Before starting with the annotation, the musicians were explained the need

    for annotation of taan section i.e. for easy navigation in a concert to go to a particular section.

    The musicians were asked to annotate the taans in similar way as though they were pointing

    out to their students the locations of taan episodes for their study of taan . Taan like movement

    might occur in other sections for short intervals as well, but does not form the taan section.

    Also, the taan section can occur multiple times and the artist has the liberty to render it any-

    17

  • Chapter 4. Database Description 18

    where in the concert. Multiple passes near the boundary were thus allowed to finalize it as a

    valid boundary. The rhythmic cycle is important here as the tempo is slow in the initial part

    of the concert and fast in the later part. If majority of a rhythmic cycle contains taan like

    melodic movements which are followed by more such movements over subsequent cycles then it

    was marked as a taan section. A taan section thus comprises of cluster of taan like movements

    occurring over multiple cycles. The section end is marked if a different homogeneous section

    other than taan episode starts.

    18

  • Chapter 4. Database Description 19

    Table 4.3: Khayal concert sections with musical characteristics and acoustic correlates

    Sections Musical Characteris-tics

    Acoustic Charac-teristics

    Features

    UnmeteredAlap

    Tabla : (a)Absent,Unmetered, non-pulsated section VocalMelody: (b) Introduc-tion of raga which is aslow rendition of raga-motifs with more steadynotes. (c) Dependingon raga nature, artistgenerally will start frommiddle octave Sa, thento lower octave, movingback to middle octavethen to higher octaveand again ending inmiddle octave Sa. (d)Rendered only withvowel a.

    (a) Wideband event ofhitting of tabla ab-sent as no percussionis present. Presenceof only voice of leadartist and accompani-ment. (b) Long steadypitch at frequenciescorresponding to notelocations of raga . (c)Evolution of pitch val-ues in alap can be ob-served (d) Harmonicsat formant locationscorresponding to vowel/a/ appear dark.

    (a) Absence of tempoin rhythmogram oftabla onsets. (b)Steady Note measureon pitch values ifrhythmogram is notsufficient. (c) Shorttime histogram can beused to see the notebeing emphasized ateach time interval andtrend over time. (d)Spectral centroid

    BadaKhayal

    Tabla : (a) Percus-sion sets in. VocalMelody: (b) Dis-tinctive features ofa raga are displayedthrough compositionhaving two parts, viz.sthayi and antara. (c)Usually the first line ofsthayi called ‘mukhada’serves as a recurringtheme in a performanceand gives a cue for the1st beat in a rhythmcycle. (d) Sthayi :Melody is in 1st partof middle octave andpart of lower octave.Antara: Melody is in2nd part of the middleoctave to upper octaveand beyond. Tempo:slow or medium

    (a) Wideband event ofhitting of tabla at reg-ular intervals seen inspectrogram. (b) Gen-erally change of pitchat the beats. Presenceof long held pitchdue with slow tempo.Lyrics comprising ofvowels and consonantsbring a break in anote being renderedcontinuously. Changein formant locationwith change in vowels.(c) ‘mukhada’ melodicphrase is generallyless variable in termsof pitches used andcan be identified toget the start of cycle.(d) Evolution of pitchvalues in alap section.

    (a) Total section ofBada Khayal andits improvisationcan be separated us-ing rhythmogram oftabla onsets. (b) Justthe bandish section canbe separated using cueof lyrics through spec-tral centroid on sourceisolated spectrumas it comes immedi-ately before alap withtabla and immediatelyafter alap withouttabla . (c) ‘mukhada’identification [19]can be used (d) Shorttime histogram can beused to see the notebeing emphasized ateach time interval andtrend over time.

    19

  • Chapter 4. Database Description 20

    Sections Musical Characteris-tics

    Acoustic Charac-teristics

    Features

    ChotaKhayal

    Tabla :(a) Tempo is fast(in comparison withBada Khayal ) VocalMelody: (b) Similarto Bada Khayal it is acomposition with sthayiand antara. Tarana canalso be taken which usessyllables like ta, na, de,re, dim instead of lyricsor even tabla bols.

    (a) Interval be-tween tabla strokesis less than that inBada Khayal. (b)Pitch not held forlong duration (relativeto Bada Khayal).Increase in pitchornamentation as com-pared to Bada Khayal

    (a) Rhythmogramof tabla onsets canseparate Chota fromBada Khayal. (b)Spectral Centroidand its delta as con-sonants occur inquicker succession ascompared to that inBada Khayal wherevowels stretched onlong notes.

    Alap /Vistar(Akar /Bol)

    Tabla : (a) Slow tempo.Percussive fillers more.Vocal Melody: (b)Slow elaboration of theraga through sequenceof melodic phrases withemphasis on restingnotes. (c) Melodymainly remains in 1stpart of middle octave.(d) Elaboration doneusing either vowel/a/(akar ) or the lyrics ofthe composition (bol ).Emphasis on melodicpatterns, variation ofmotifs.

    (a) Wideband eventof hit on percussiveinstrument visibleat large separated in-stances as compared tothat in Chota Khayal.Fillers present. (b)Constant pitch heldfor a long time. (c)Pitch remains inmiddle octave. (d)Formant changes aslyrics are uttered.

    (a) Rhythmogram ontabla onsets will not beable to distinguish thissection from others inBada Khayal impro-visation section. (b)Steady note measureto distinguish frombaat and taan sec-tion. (c) Short timehistogram can be usedto see the note beingemphasized at eachtime interval and trendover time. (d)Spectralcentroid to distinguishbol from akar .

    20

  • Chapter 4. Database Description 21

    Sections Musical Characteris-tics

    Acoustic Charac-teristics

    Features

    Baat /Layakari(Sargam/ Bol)

    Tabla : (a) May mimicthe patterns of singer orplay normally depend-ing on context. VocalMelody: (b) Rhyth-mic improvisations us-ing names of the notes(sargam ), or lyricsof composition (bol ).Stress given at the beatsby doing note changesor lyric changes there.Speed of bol/sargamcan be same, or twiceor any multiples of thetempo but not as fast asin taan . Playing withnotes and lyrics withemphasis on rhythm.(c) No long held notes.

    (a) Regular intervalconsonant breaks inthe spectrogram. Notetransition at percus-sive hit. (b) Pitchmodulation, if any,moderate as comparedto taan . No long heldpitch.

    (a) Rhythmogram onvocal onsets will beuseful after source iso-lation. Spectral cen-troid after source iso-lation (b)Pitch modu-lation captured via en-ergy ratio with differ-ent range of frequen-cies corresponding tooscillations here

    Taan(Akar /Bol /Sargam)

    Tabla : (a) Tempomight get faster withbasic tala being played,without fillers. VocalMelody: (b) Rapidgamak like movementstaken using vowel /a/(akar ), lyrics of thecomposition (bol )or name of the notes(sargam ). (c) Thepatterns range overall the 3 octaves andthe section serves asclimax to the raga pre-sentation. Emphasison vocal skills. This ismost distinctive sectionof the Khayal rendition.

    (a) Wideband eventof hit of percussivebeats at regular inter-vals (with no fillers)but at less time inter-val. (b) Rapid pitchoscillations. Rapid en-ergy fluctuations. Forakar , formants cor-responding to vowel/a/ look dark. Incase of bol the for-mants change gradu-ally, while in case ofsargam they changerapidly.

    (a) Rhythmogramcannot distinguish thissection from others.(b) Frequency of os-cillatory pitch, rate ofzero crossings of meansubtracted energycontour. Spectralcentroid for formantchange detection inbol and sargam .

    21

  • Chapter 5

    Pre-processing

    Before extracting the features, we need to extract the pitch corresponding to the vocal melody.

    This we approach via slight modifications to the state-of-the-art singing voice detection (SVD)

    algorithm [11]. Also another step required is to obtain finer ground truth for calculating the

    frame-wise accuracy of taan detection as described in section 4.1.1. Each of the tasks is described

    in detail as below.

    5.1 Singing Voice Detection (SVD)

    Taan features are derived from the melody alone and rely on the accuracy of the vocal melody

    extraction. An important step for extracting melody is the detection of regions where the singing

    voice is present so that melody can be extracted only in those regions as explained in section

    3.1.1. This data Subset A has been annotated for Vocal and Instrumental regions for SVD

    evaluation. The total duration of the annotated data is 17hrs out of which the duration of

    vocal regions is 12hrs. If the artist takes any pause for taking breath then that is ignored and

    included in the vocal region. Breath pauses are ones which have duration perceptually insignif-

    icant (roughly less than 80ms) as referred to by [23]. The output of the SVD algorithm can be

    compared with this annotated ground truth to obtain the accuracy.

    The pitch extraction of the dominant source is first carried out using the technique of [4].

    The assumption is that the vocal melody being in the lead will be having the strongest contri-

    bution to the spectrum. Thus by extracting the predominant F0, it is expected to get melody

    corresponding to the voice where both voice and accompaniment are present. The F0 is further

    used to isolate the dominant source spectrum by reliably extracting sinusoidal partials using

    main-lobe matching technique. Line-spectra are obtained by searching in the vicinity at multi-

    ples of location of the detected pitch.

    Features are extracted on the source isolated spectral envelopes as they are less dependent

    on the pitch of the source and represent the source timbre better. The static timbral, dynamic

    timbral and dynamic F0- harmonic feature categories were proposed in [11]. Here they have ex-

    perimented with just 13mins of Hindustani classical audios and used supervised GMM in leave

    one song out validation with four mixtures per class giving an accuracy of 84%. In this work,

    22

  • Chapter 5. Pre-processing 23

    we combine feature categories and feature selection is applied using WEKA toolbox [20]. In

    [11] feature selection was applied within each feature category and supervised classification was

    carried out. We get 11 selected features from among all the categories (total 85 features) by

    using Best First search method in WEKA toolbox [20]. We explore the possibility of using unsu-

    pervised classification algorithm as the nature of the acoustic characteristics is clearly different

    for voice and instrument regions within an audio. A particular audio will always contain tabla

    for rhythmic accompaniment and tanpura for giving tonic as reference, but the melodic accom-

    paniment might be sarangi or harmonium. Also an artist might choose to use a swaramandal

    for different reference notes to be played. In rare cases for the Chota Khayal section, the artist

    might choose to use both tabla and mridangam for rhythmic accompaniment. Due to this, it

    would probably be better to make a within clip decision for Vocal and Instrumental marking.

    Unsupervised classification algorithm, k-means clustering, is applied and labels are obtained

    for the frames as belonging to either the voice or the instrumental regions by using one of the

    dominant feature to make the decision. Further from here we can extract the pitch correspond-

    ing to the vocal regions alone to give it to the feature extraction block. Singing Voice Detection

    (SVD) using k-means clustering and genre specific features was seen to give 88.30% accuracy

    when evaluated on the data Subset A. Further, the SVD accuracy within the taan regions was

    also evaluated so that there is certainty that there are no SVD related errors for the taan episode

    detection. The frame-wise accuracy within the taan regions came out to be 89.20% for the data

    Subset A. The labeling of the clusters obtained after k-means is of major concern. The labeling

    is done using one of the features with highest individual classification accuracy namely, the nor-

    malized harmonic energy.

    Figure 5.1: Algorithm marked Vocal (V) and Instrumental (I) boundaries for the audioconcert with least accuracy. The highlighted “I” marking should have been “V”

    The accuracy was obtained by comparing the framewise labels and it came out to be 88.30%

    which makes our SVD algorithm reliable for using it further for extraction of pitch in vocal

    regions alone. The analysis of two clips where the accuracy was below 70% shows that the spec-

    trograms were washed out i.e. the vocal harmonics were not clear in the spectrograms and in few

    instances the background accompaniment was loud. As can be seen in Figure 5.1 the highlighted

    region does not look clear though the voice of lead vocal artist is present / audible in that region

    of the audio. Tanpura suppression might help in some cases where loud tanpura is affecting the

    decision of SVD since long steady notes are getting marked as belonging to Instrumental category.

    23

  • Chapter 5. Pre-processing 24

    Though the accuracy of SVD was seen to be high, we inspected all the taan regions in the

    total database of 102 concerts. It was seen that there are particularly 6 clips which had problems

    in SVD decision in taan region apart from the 2 error clips analyzed in the data Subset A. In

    the 6 concerts that were eliminated from the 102 concerts for taan detection to create the data

    subset D, there was presence of background vocals. Also there was presence of accompanying

    instrument sarangi which has frequency range similar to human voice and is played in continuous

    fashion unlike harmonium. There is hence confusion arising in Vocal and Instrumental decisions

    made by SVD algorithm. Figure 5.2 illustrates one such case. Though the pitch of the sarangi

    is played higher than the voice, the SVD is not able to give more weight to pitch feature as all

    the features have equal weights.

    Figure 5.2: Spectrogram where along with the lead artist there is presence an instrumentsarangi for accompaniment which has frequency range similar to human voice

    The frame level accuracy of SVD was also calculated using supervised GMM classification

    in leave one song out validation as was proposed by [11]. Four Gaussians were used to model

    each class (Vocal and Instrumental) and the accuracy was found out to be 80.10% which is less

    than that using k-means classification (88.30%).

    5.2 Obtaining Finer Ground Truth within Taan Sec-

    tion (SAD Method)

    We observed that the musicians labeled taan based on the perceived intent of the performer i.e.

    relatively short duration of instrumentals and other vocal styles that occur between taan episodes

    were subsumed by the taan label (as in Figure 5.3). For the real-world use case, we would like

    our automatic system to match the musicians labeling of the taan sections in the concert. At

    the same time, for initial evaluation, we do not want the classifier accuracy to get affected by the

    non-taan regions coming between the taan section as seen in the Figure 5.3. In-order to facilitate

    this, the finer frame-level markings within the taan section were obtained automatically using

    the Speech Activity Detector (SAD). The SAD method is completely unsupervised and is based

    on the speaker diarization system [14].

    The SAD was provided with our carefully derived pitch and energy based features. It is an

    iterative process of classification, where separate Gaussian Mixtures Models (GMM) are fitted

    to the frames classified as speech and non-speech (in our case taan and non-taan). Classification

    24

  • Chapter 5. Pre-processing 25

    is performed using these models on all the frames again. The process repeats and it stops when

    there is convergence or till

  • Chapter 6

    Taan Section Detection

    Taan section is important to be identified as it gives a hint of the end of a Bada Khayal or

    Chota Khayal improvisation part. Taan section’s typical characteristic is rapid oscillatory pitch

    fluctuations as seen in Figure 6.1. Further information about the type of taans and the need

    to use pitch based features for its detection is given in table 4.3. It is similar to vibrato in

    Western music but with the rate of frequency modulation being slightly different and the fact

    that the oscillations are not limited about just a single note. These oscillations are irrespective

    of whether it is being rendered in Bada Khayal improvisation section or in Chota Khayal. While

    there might be some changes in the nature of acoustic characteristics in other sections due to

    gradual increase in tempo over the entire duration of the concert, the taan rate typically lies in

    the range of 5-10Hz across the artists irrespective of the underlying tempo changes occurring.

    A spectrogram of a typical akar taan can be seen in Figure 6.1 where the x axis is time

    and y axis is frequency in Hz. Observing across various concerts, the pitch oscillation frequency

    was observed to be between 5-10Hz. As per the conclusions of studies conducted in [21], in a

    pedagogical scenario the rate of taan oscillations range 1.65 to 3.14Hz, but we are interested in

    actual performance scenario where this rate is quite high. The vocal melody extracted from a

    concert shows the rapid pitch oscillations in the taan section which starts after 1196sec in Figure

    6.2 (a) as compared to steady notes and slow ornamentation before 1196sec. As can be seen in

    Figure 6.2 (a) there are around 6 oscillations occurring in one sec interval of 1198sec to 1199sec.

    Figure 6.1: Spectrogram of a part (7sec) of a longer akar taan section in raga Madhukaunsperformance by artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melodyare seen with the darker harmonics corresponding to vowel /a/. The beat tier indicatesthe tabla hits which are visible as vertical lines in the spectrogram.

    26

  • Chapter 6. Taan Section Detection 27

    Figure 6.2: Performance by Kumar Mardur of raga Shree where taan section begins at1200sec with (a) pitch variations and (b) energy variations depicted corresponding to onlythe voice.

    Figure 6.3: Spectrogram of a part of sargam taan section in raga Madhukauns perfor-mance by artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melody areseen with the darker harmonics corresponding to phones in note names like Pa, Ni, Sa,Ga, etc. The beat tier indicates the tabla hits which are visible as vertical lines in thespectrogram.

    Another characteristic is that the energy corresponding to the taan section fluctuates rapidly as

    compared to that in other sections as seen in Figure 6.2 (b).

    While rendering an akar taan, the artist generally sticks to the vowel /a/ and the formant

    locations corresponding to it can be seen as dark lines in the Figure 6.1. When we observe

    the spectrogram of sargam taan in Figure 6.3, a lot of movement of formants can be seen as

    the singer utters the swaras or solfege while singing the oscillatory melodic movements. Thus

    there can be seen a lot of ‘breaks’ in the melody line which correspond to the consonants being

    uttered. The bol taan section in Figure 6.4 also shows movement of formants, but not as rapid

    as in case of sargam taan. This is because the artist utters the consonants after larger time inter-

    vals as compared to that in sargam section and uses the vowels in the lyrics for a longer duration.

    27

  • Chapter 6. Taan Section Detection 28

    Figure 6.4: Spectrogram of a part of Bol taan section in raga Madhukauns performanceby artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melody are seenwith the darker harmonics corresponding to phones of lyrics ‘Tuma bina kaun’. The beattier indicates the tabla hits which are visible as vertical lines in the spectrogram.

    6.1 Pre-processing / Melody Extraction

    Vocal pitch is the only cue for distinguishing the taan section, therefore, we need to extract it

    reliably to compute features on it. The SVD algorithm accuracy is 88.30% as seen in section

    4.1.1(i) which makes the decisions of Vocal and Instrumental for the case of Hindustani music

    very reliable. The pitch extraction of the Vocal regions marked by SVD algorithm is carried out

    using the technique of [4] henceforth referred to as PolyPDA in this thesis.

    6.2 Feature Extraction

    We try to capture the acoustic characteristics of the taan section after observation of spectro-

    grams in a number of clips. Both the pitch oscillations and the variations in energy contour can

    be captured to detect the taan sections out of all the sections as per our observations pointed

    in Figure 6.2. Regular pitch oscillations can be seen in the taan section as opposed to the

    non-taan section. Additionally, the fluctuations in the energy contour are seen to be high in

    case of the taan section. The pitch values are calculated at every 10ms interval after pitch ex-

    traction from the polyphonic audio corresponding to only voice. Features are first calculated at

    smaller analysis frames and then averaged over larger texture windows. This helps to eliminate

    taan like movements that might occur in other sections spuriously and obtain an average overall

    behaviour. For our study we have considered texture window of 5sec which is slightly less than

    the minimum duration of taan section. The texture frame hop was kept to be 1sec. The anal-

    ysis frames are of 1sec with 0.5sec hop as also seen to be effective in [22]. As can be seen from

    the Figure 6.5, averaging over the texture windows is essential to avoid spurious feature values

    when a taan like movement occurs in non-taan section. The pitch contour is first interpolated

    if silences less than 80ms are present to avoid breaks in the pitch in presence of consonants and

    breath pauses [23]. The feature values are normalized to zero-mean and unit variance across

    the concert. From Figure 6.5 it can be seen from the unnormalized feature values that the

    distinction between taan and nontaan regions is quite clear.

    28

  • Chapter 6. Taan Section Detection 29

    Figure 6.5: The taan feature values for Energy Fluctuation Rate feature for entire concertof raga Shree performance by artist Kumar Mardur with the ground truth of taan sectionas red boxes. Features are plotted at (a) analysis window length 1sec and hop 0.5sec, (notexture window applied) (b) texture window length 5sec and hop of 1sec. The distinctionin the taan section feature values is more evident in (b) than in (a)

    6.2.1 Pitch Based Features

    The DFT spectrum (128 point) of 1sec segments (1sec = 100 pitch values at 10 ms sampling

    of pitch) of the pitch contour is computed using a sliding analysis frame of 1sec with hop size

    of 500ms. After calculating the DFT, features- Energy Around Maximum Amplitude in DFT

    (EAMA) and Frequency Corresponding to Maximum Amplitude in DFT (FCMA) are extracted

    from it. For calculating EAMA feature, 2 bins before and after the Maximum Amplitude in DFT

    are considered ( 3.9Hz). This is done to overcome the bin resolution limitations. Equations for

    FCMA and EAMA are as given in eq. 6.2 and eq. 6.3 respectively. Here, Maximum Amplitude

    Value in DFT (MAV) of an analysis frame is given by eq. 6.1

    MAV = max(|Z(k)|2) (6.1)

    FCMA = fMAV (6.2)

    EAMA =

    kMAV +2∑k=kMAV −2

    |Z(k)|2 (6.3)

    where Z(k) is the DFT of the mean-subtracted pitch trajectory z(n) with samples at 10ms

    intervals, and kMAVHz is the frequency bin closest to fMAV Hz.

    29

  • Chapter 6. Taan Section Detection 30

    Figure 6.6: Shows the (a) taan frame values corresponding to 2 features plotted in blackcircles for Bada Khayal and blue crosses for Chota Khayal (b) Gaussians plotted usingthe mean and variance of the taan features in Bada Khayal and Chota Khayal to visualizetheir overlap

    6.2.2 Energy Fluctuation Rate

    The energy values corresponding to only the vocal melody are also available at every 10ms along

    with the pitch. To capture the variations in energy contour as seen in Figure 6.2(b), mean value

    is first subtracted from the energy contour in an analysis frame of 1sec and then the number of

    zero crossings detected are used as a feature value for that frame. Here as well, we use hop of

    500ms for the analysis frame of 1sec and average these values over texture frames of 5sec with

    1sec hop.

    6.3 Inspection of Variability of Features with Tempo

    As described in the 4.1.1(iii), over the concert duration, gradually, the tempo keeps increasing.

    We want to inspect if the features proposed are invariant to the tempo variation. We use the

    data Subset A to evaluate this. Generally an abrupt change in tempo is seen at the start of

    Chota Khayal section. The timing of this starting point of Chota Khayal can be used for this

    purpose. Figure 6.6(a) shows feature values plotted across the data Subset A with the high over-

    lap between taan feature values in Bada Khayal and Chota Khayal section. The mean values

    of Bada Khayal and Chota Khayal taan features are -0.0302, -0.0458 and 0.0761, 0.1151 respec-

    tively for the scatter plot over 2 features, while the variance is 0.0325,0.0291 and 0.0507,0.0487

    respectively. These feature values are close and their overlap can be visualized as seen in Figure

    6.6(b) with the help Gaussian contours plotted using the mean and variance values mentioned

    above.

    Euclidean distance between the Bada Khayal and Chota Khayal taan features was calcu-

    lated. Euclidean distance was also calculated of taan features within the Bada Khayal and

    within Chota Khayal sections. Histogram of these distance values is plotted as seen in Figure

    6.7. The mean (0.6) of the distance calculated of taan feature vector across Bada Khayal and

    Chota Khayal was comparable with the within Bada Khayal and Chota Khayal distances’ means

    (0.49, 0.59 respectively). Thus, both quantitatively and also from the figure it is evident that

    30

  • Chapter 6. Taan Section Detection 31

    Figure 6.7: Histogram of Euclidean distances in the feature values from lower tempoBada Khayal section and higher tempo Chota Khayal section are comparable with thedistances within Bada Khayal and those within Chota Khayal

    the feature values are tempo independent.

    6.4 Classification and Grouping using Posteriors

    A frame-wise classification into taan and non-taan styles is carried out for all frames in the

    vocal segments by a trained MLP network. We use a feed-forward architecture with a sigmoid

    activation function for the hidden layer comprising 300 neurons. Training uses cross-entropy

    error minimization via the error back propagation algorithm. We compare the frame level

    accuracy of MLP with SVM classifier in the data Subset B using leave one song out validation.

    As seen in table 6.1 the MLP is seen to perform better than the SVM, thus we choose to proceed

    with MLP. We also give the accuracy using the Deep Belief Network (DBN) with various number

    of hidden layers of 300 neurons. We used 2, 3 and 4 hidden layers and their accuracies were

    93.54%, 93.29% and 93.45% respectively. Also the number of neurons was experimented with,

    by changing it from 100 to 1000 in steps of 50. No significant change in accuracy was observed

    with change in the number of neurons. Since the data is not large in number, we decided to use

    less number of neurons and with just 1 hidden layer. The precision and recall for taan and non-

    taan were similar to MLP results. Upon classification, the recall and precision of taan frame

    detection with respect to the ground-truth can serve to measure the discriminative power of

    the features. In our case however we seek to label continuous regions of the audio rendered in

    taan style much as a human annotator would. This requires the grouping of frames based on

    homogeneity with respect to the taan characteristics. Novelty detection based on a self-distance

    matrix is an effective way to find segment boundaries [6].

    We use a recently proposed approach to computing the SDM from the posterior probabil-

    ities derived from the features rather than the features themselves [1]. The use of posterior

    probabilities for computation of SDM helps to obtain enhanced homogeneity due to the reduced

    sensitivity to irrelevant local variations. Euclidean distance between vectors comprising of the

    posteriors is used for calculating the SDM. The posteriors are the class-conditional probabili-

    ties obtained from the MLP classifier for each test input frame. Points of high contrast in the

    31

  • Chapter 6. Taan Section Detection 32

    Table 6.1: Comparison of frame-wise accuracies of SVM and MLP with Precision andRecall values for each class

    SVM91.74%

    Predicted MLP93.58%

    PredictedPrecision Recall Precision Recall

    Labeltaan 0.7018 0.8928

    Labeltaan 0.7962 0.8216

    non-taan 0.9768 0.9224 non-taan 0.9586 0.9647

    SDM are detected by convolution along the diagonal with a checker-board kernel [12] whose

    dimensions depend upon the desired time scale of segmentation. Considering the minimum

    taan episode duration, this is chosen to be 5sec in the interest of obtaining reliable boundaries

    with false negatives. The resulting novelty function is searched for peaks, representing segment

    boundaries, using ‘local peak local neighborhood’ [7]. Whether a region between two detected

    boundaries corresponds to a taan is determined by examining the majority of the frame-level

    classification in that region. Finally, the highest level of grouping is obtained by examining the

    region of audio separating every two detected taan segments. A simple heuristic is set up to

    mimic the musicians’ annotation, as discussed in section 4.3, such that taan episodes separated

    by non-taan vocal activity within 10sec are merged into a single section. The merging is also

    applied if the separation corresponds to a purely instrumental region of duration within 50sec.

    The intermediate step of frame-wise classification is working well using MLP as can be seen

    from the accuracy of taan v/s non-taan classification. The evaluation of the grouping is described

    in the next chapter.

    32

  • Chapter 7

    Experiments and Evaluation

    Our ideal system would detect and segment taan sections similar to a musicians’ labeling. This

    high level task is attempted by the sequence of frame-level automatic classification and higher

    level grouping as described in section 6.4. We perform 2 types of experiments, first with the

    smaller artist specific data Subset B for deciding the best operating point for conversion of MLP

    posteriors into labels. The second experiment is using the parameters decided in the data Subset

    B on the data Subset D.

    We present experimental results on the performance of each of the modules. Frame-level clas-

    sification is measured by the detection of taan in terms of recall and precision. Artist-dependent

    and artist-independent training are compared within the 22 concert database. The frame-level

    classification needs frame-level (i.e. 1sec resolution) annotation of taan presence or absence.

    This is required both for the training of the classifiers as well as for reliable testing. The mu-

    sician labels are not useful as such for this end due to the presence of non-taan interruptions

    of significant duration within the musician labeled taan sections as seen before. Thus, for the

    development of the frame-level classifier, we need a finer marking of taan segments. Since this

    is a demanding task to carry out manually, we use a bootstrapped iterative approach of SAD

    as described in section 4.1.1(ii). SAD evaluation showed that the frame-level labels so obtained

    were indeed accurate and these were then used to train and evaluate the frame level classifiers.

    The system is also evaluated after grouping, this time in terms of the match between the de-

    tected segments and the subjectively labeled taan segments for each concert. We tabulate the

    taan episode detection by reporting the number of over-, under-segmentation, exact detection,

    false negatives and false positives.

    Conventional measures of performance include computing of cluster purity, pair wise ham-

    ming distance, boundary precision and recall, etc as detailed in [24]. Any two of the measures

    carefully selected will give us an idea about the type of segmentation that is achieved. Since our

    task is of detection, along with these measures we need to report the number of detections as

    well. Cluster purity and boundary retrieval were used by [1] to report their segmentation eval-

    uation. We also report the cluster purity and boundary retrieval values, as per these standard

    evaluation measures, in table 7.1 for completeness and to emphasize that they are not enough

    to give a picture of taan section detection. The high cluster purity values (close to 1 is good)

    show that the detected taan and non-taan sections are homogeneous as also there is over and

    33

  • Chapter 7. Experiments and Evaluation 34

    Figure 7.1: Various scenarios that occur after grouping viz. (a) false positive, (b) over-segmentation, (c) exact detection, (d) false negative, (e) under-segmentation

    under-segmentation but we do not get an idea about the number of detected taan sections. Same

    is the case for the boundary retrieval values reported. The boundary retrieval values also reflect

    the under and over segmentation but their number is not conveyed.

    Table 7.1: Performance evaluation using conventional measures on (a) proposed systemand (b) GMM based system of [1] after grouping at frame-level

    Boundary Retrieval Cluster PurityMethod Precision Recall acp asp k

    (a) MLP based 0.448 0.578 0.768 0.779 0.773(b) GMM based 0.2187 0.4062 0.763 0.621 0.6869

    To the best of our knowledge, however, no attempt has been made towards the problem of

    using supervised classification for labeling of the segments and especially in the case of taan sec-

    tion detection problem in Hindustani vocal concerts. It is not possible to compare our results

    with any other method apart from [1] which works on Dhrupad and instrumental concerts. Here,

    they do not label their segments which we have modified in the case of taan to perform labeling.

    Also the conventional measures of performance will not be able to give a fair idea about the

    number of sections marked in the ground-truth which have been correctly labeled and retrieved.

    Along with the frame-level accuracy we measure the performance by including the number of

    correctly retrieved taan segments and number of false positives. A section is said to be correctly

    retrieved if there is an overlap of at least 50% of its duration with a detected segment. Also of in-

    terest is the extent of over- or under-segmentation of the correctly detected taan sections. Figure

    7.1 illustrates the different possibilities of mismatch that are observed between subjective labels

    and automatically labeled sections. When subjectively labeled section is correctly detected, it

    is observed that the onset and offset boundaries are always within 5sec of the corresponding

    ground-truth boundaries indicating the reliability of the posteriors based detection of sections.

    7.1 Evaluation on data Subset B

    Our audio data subset B consists of 57 Khayal vocal concert recordings partitioned into two

    distinct sets of 22 single-artist (Pt. Jasraj) concerts, and 35 multi-artist concerts (that do not

    34

  • Chapter 7. Experiments and Evaluation 35

    Figure 7.2: Shows (a) ROC by thresholding the posterior values obtained from MLP forleave-one-song-out case of 22 concerts and for 35 train and 22 test concert scenario and (b)SDM+novelty+Grouping for 35 train set and 22 test set scenario using MLP posteriors,the pink boxes indicate the musician marked ground truth, the red stars indicate the MLPlabels, the novelty score is black continuous contour, the green filled boxes show detectedtaan regions before grouping and the circled peaks connected by blue lines show the finaltaan episodes after grouping.

    contain Pt. Jasraj). In both cases a number of different ragas are covered at various tempi. All

    artists are male. The 22 concert set is treated as the test set with two different training con-

    ditions: artist specific training via leave-one-song-out cross validat