Detection of Taan Sections in Khayal Vocal Concerts · 2021. 1. 13. · of the various sections, although the performance typically follows an established structure de-pending on

Detection of Taan Sections inKhayal Vocal Concerts

Submitted in partial fulfillment of the requirements

of the degree of

Master of Technology

by

Amruta J. Vidwans

(Roll no. 123076005)

Supervisor:

Prof. Preeti S. Rao

Department of Electrical Engineering

Indian Institute of Technology Bombay

2015

Amruta J. Vidwans/ Prof. Preeti S. Rao (Supervisor): “Detection of Taan Sections in

Khayal Vocal Concerts”, MTech. Degree Dissertation, Department of Electrical Engineer-

ing, Indian Institute of Technology Bombay, July 2015.

Abstract

Structural segmentation of concert audio recordings is very useful in music navigation and au-

tomatic summarization. It is particularly strongly indicated for Indian classical music where

concerts can extend for hours, and commercial audio recordings rarely provide timing details

of the various sections, although the performance typically follows an established structure de-

pending on the genre. The distinct concert sections have contrasting rhythmic, and sometimes

melodic, structures. The proposed work is concerned with the automatic segmentation of spe-

cific musical sections from the audio of Khayal vocal music concerts. The taan section, has a

distinct melodic character across concert, irrespective of the tempo. Our goal is to label the

taan sections using acoustic features that capture the melodic style. The features are derived

from low-level audio analysis including pitch and energy tracking of the singing voice.

The proposed system does binary classification of frames into taan and non-taan classes.

The posterior probability vectors obtained in the course of the statistical classification are used

in grouping stage. A higher time-scale smoothing as befits the concert section detection motive

is achieved by using change detection methods. The grouping stage uses heuristics derived from

a study of musicians’ annotations of taan episodes on a concert data subset. We evaluate the

system in two stages: by its frame level classification accuracy, and by reporting the number of

(over-, under-segmentation, true) detected, false positive and false negative taan episodes after

the grouping stage. We compare the results of our proposed method with an unsupervised seg-

mentation method showing that the proposed method achieves superior results over a database

of 96 concerts in terms of the giving fewer false positives.

Index terms: audio segmentation, taan detection, multilinear perceptron (MLP),

posterior probability, self-distance matrix (SDM), novelty score, Hindustani Clas-

sical music, Khayal vocal concerts

iii

Contents

Dissertation Approval i

Declaration of Authorship ii

Abstract iii

List of Figures vi

List of Tables vii

1 Introduction 1

2 Literature Survey 3

3 Segmentation System Overview 6

3.1 Feature Extraction Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Change Detection and Labeling Block . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Database Description 11

4.1 Database Subsets for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Database Subset of 32 Concerts . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.2 Database Subset of 96 Concerts . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.3 Database Subset of 22 Truncated Jasraj Concerts . . . . . . . . . . . . . . 13

4.1.4 Musicians’ Annotation: 24 Concert Subset . . . . . . . . . . . . . . . . . . 14

4.2 Khayal Concert Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Khayal Concert Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.2 Taan Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Annotation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Pre-processing 22

5.1 Singing Voice Detection (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Obtaining Finer Ground Truth within Taan Section (SAD Method) . . . . . . . 24

iv

Contents CONTENTS

6 Taan Section Detection 26

6.1 Pre-processing / Melody Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2.1 Pitch Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2.2 Energy Fluctuation Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3 Inspection of Variability of Features with Tempo . . . . . . . . . . . . . . . . . . 30

6.4 Classification and Grouping using Posteriors . . . . . . . . . . . . . . . . . . . . . 31

7 Experiments and Evaluation 33

7.1 Evaluation on data Subset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.2 Experiments on data Subset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.3 Experiments on data Subset D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Conclusion and Future Work 40

Acknowledgements 44

v

List of Figures

3.1 Simplified Block Diagram for detecting sections in Khayal vocal concert . . . . . 6

3.2 Block Diagram for the proposed system of detecting segments in Khayal vocal

concert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 SDM and Novelty for sitar concert . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Possible sequence of sections in Khayal vocal concert . . . . . . . . . . . . . . . . 16

4.2 Rhythmogram of tabla onsets in a Khayal vocal concert . . . . . . . . . . . . . . 16

4.3 Various sections in Khayal vocal concert . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 Algorithm marked Vocal (V) and Instrumental (I) boundaries . . . . . . . . . . . 23

5.2 Example spectrogram of errors in SVD . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Example spectrogram of non-taan movement occurring in taan episode . . . . . 25

6.1 Spectrogram of an akar taan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 Pitch and energy contour of taan segment . . . . . . . . . . . . . . . . . . . . . . 27

6.3 Spectrogram of a sargam taan segment . . . . . . . . . . . . . . . . . . . . . . . 27

6.4 Spectrogram of a bol taan segment . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.5 The Energy Fluctuation Rate feature for an audio plotted (a)without texture

window applied (b) texture window of 5sec with hop of 1sec . . . . . . . . . . . . 29

6.6 Taan frame feature values for Bada Khayal and Chota Khayal . . . . . . . . . . 30

6.7 Histogram of Euclidean distance between feature values to show tempo invariance

of taan features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.1 Various scenarios that occur after grouping . . . . . . . . . . . . . . . . . . . . . 34

7.2 Shows (a) ROC by thresholding the posterior values obtained from MLP (b)

SDM+novelty+Grouping stages in one of the Subset B audio . . . . . . . . . . . 35

7.3 Comparison of pitch contours obtained from PolyPDA and Melodia plug-in . . . 37

7.4 Comparison of taan episodes detected after SDM+novelty+grouping using pos-

teriors from (a)MLP and (b)GMM . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.5 Errors in taan episodes detection after SDM+novelty+grouping using posteriors

from MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vi

List of Tables

4.1 Distribution of clips per artist and the gharana of the artists . . . . . . . . . . . 12

4.2 Summary of database subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Khayal concert sections with musical characteristics and acoustic correlates . . . 19

6.1 Comparison of frame-wise accuracies of SVM and MLP with Precision and Recall

values for each class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.1 Performance evaluation using conventional measures on (a) proposed system and

(b) GMM based system of [1] after grouping at frame-level . . . . . . . . . . . . . 34

7.2 Taan detection performance after grouping for 35 train and 22 test concert sce-

nario (92% of taan detected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.3 Shows taan detection performance after grouping for (a) pitch extracted from

Melodia plug-in (33% of taan detected) (b) pitch extracted from PolyPDA (82%

of taan detected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.4 Shows taan detection performance after grouping using (a) MLP on data Subset

D (80% of taan detected)(b) GMM on data Subset D (86% of taan detected) . . 38

7.5 Frame level classification accuracy, precision and recall using MLP on data Subset

D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vii

Chapter 1

Introduction

Audio segmentation systems aim at representing the audio at a broader level via labeled musical

sections. In case of Hindustani classical music there is much variability with respect to dura-

tion of the concert, the number of repetitions of the segments, the freedom of improvisation,

the school of music and the artist’s individuality of rendering the concert. The most popular

performance in a Hindustani classical vocal concert is of the Khayal genre. The problem of

segmentation in Indian Classical music has been approached using rhythmic features in the past

for tabla concerts and a few Khayal vocal concerts [2], but lacks in evaluating the performance

of the system or structural analysis as it gives only the visual representation. There are no

papers that we know of addressing the problem of Khayal concert segmentation apart from [2].

Attempts were made by [3] for the case of Carnatic music for classification of an audio into a

section rather than segmentation within a concert, where the presence or absence of percussion

was an important cue. Review of audio segmentation systems for different music styles is done

in detail in Chapter 2

The segmentation of Khayal concert holistically is possible by using other dimensions of the

audio, i.e. pitch, energy and timbre along with rhythm. We study the musical and acoustic

characteristics of various Khayal sections and possible features that can be used for them. For

certain Khayal sections, like the taan section in particular, the nature of the pitch variation is

the only cue that can be used for distinguishing it from other sections. We explore the possi-

bility of taan detection and labeling in Hindustani classical Khayal vocal concerts by proposing

features that distinguish taan from other sections. The main challenge is detection of taan sec-

tions at a time scale meaningful to musicians. There should be mechanism to ignore taan like

movements that may occur in other sections in Khayal performance. Taan section might occur

multiple times in the Khayal performance and we need to retrieve all of them without inducing

false positives. These are the few challenges of taan detection and labeling that we wish to

address through this work. A general system for detecting sections of the Khayal vocal concerts

is proposed. Since taan section can be distinguished using pitch alone, we need an automatic

vocal pitch extraction algorithm. Pitch extraction algorithms are not yet fully automatic with

around 80% accuracy possible. Since pitch extraction of the lead vocal artist is a crucial step for

coming up with features for detecting sections in Khayal, one of the goals will also be to look into

possibilities to improve the accuracy of the current state of the art pitch tracker [4] by improving

one aspect of it, which is the Singing Voice Detection (SVD). The study of detection of sections

Chapter 1. Introduction 2

and labeling system for the Khayal vocal concert with proposed features for the taan section

and evaluation of the system is the main focus. Such a system will be helpful in audio browser

like Dunya [5] which is dedicated for easy access and navigation of Hindustani classical concerts.

It will be useful in fast navigation through Khayal concerts and for music summarization.

The existing segmentation systems are discussed in Chapter 1. There is no audio segmen-

tation system that works on Khayal concerts, but systems that work using a different database

that might be applicable to our case are mentioned. We give an overview of the proposed sys-

tem in Chapter 3. We discuss about the database that we have used for our study and give the

specialities of sections in Khayal concerts in Chapter 4. The database was selected considering

that all types of taan are present in the database and there is representation of almost all school

of thoughts in Hindustani classical music. Pre-processing of the audios is required to obtain

reliable pitch track corresponding to the vocal melody which is discussed in Chapter 5 followed

by description of features for taan section of Khayal in Chapter 6. The experiments done at

intermediate and final stage of the system and discussion of the results obtained is presented in

Chapter 7 followed by conclusion in Chapter 8.

2

Chapter 2

Literature Survey

A broad overview of methods available for structural segmentation is summarized in [6]. The

primary aim of the methods described in the overview paper is to identify the various homo-

geneous segments within an audio and in case of some papers labeling the segments as well.

The papers reviewed in this article mainly have features that exploit one or group of aspects

related to melody, harmony, rhythm and timbre. Only a few papers have used all these aspects

of audio together for segmentation and the author states that this might be more effective way

of approaching segmentation problem. The three criteria to approach the problem of segmen-

tation across the papers are seen to be based on repetition, homogeneity, and novelty. Here the

homogeneity and novelty based approaches give similar information with homogeneity based ap-

proaches telling about the contents within a segment and novelty based approaches telling about

the boundary between contrasting sections. None of the approaches were seen to be particularly

superior to other and have been tried only on Western popular music where the structure is

relatively standard. The author stresses on investigating the use of musically motivated features

and different distance measures for frame level feature comparison.

The paper by Turnbull et. al. [7] is important in the view that it has used all the aspects

of the audio of timbre, melody, rhythm and harmony. They use a supervised approach where

the features along with their first and second order difference as also their smoothed versions

are passed to a boosted decision stump classifier. Here they train the classifier to give out-

put as frames belonging to boundary or non-boundary class which is an unconventional way

to approach the problem. Using approach of providing the information of change was seen to

better for a supervised framework. Also the final features after feature selection were seen to

be corresponding to all the aspects i.e. rhythm, timbre, melody and harmony thus stressing the

importance of using all the dimensions of music for effective segmentation.

The paper [8] deals with applying HMM based methods for the case of music segmentation

problem. The author mentions that the method is likely to succeed on recordings made with

modern production techniques, particularly where copy and paste has been used to produce mul-

tiple segments of the same type. However the noteworthy aspects which can be mentioned are:

the use of adaptive hop and window size (for feature calculation) depending on the beat interval

and use of histogram of states obtained from the Hidden Markov Model (HMM) to determine

the labels. The segmentation is performed on western music audios with sub-band energy in

3

Chapter 2. Literature Survey 4

logarithmically spaced bands as feature. A beat tracking algorithm is used to track the beat

location which are used as a hop size with the frame size also changing dynamically according to

the inter-beat interval. This can be useful in Khayal vocal concert segmentation as the tempo

keeps increasing gradually over the concert. Thus we might want to use similar functionality

for feature calculation. An HMM is trained with fairly large number of states on the features.

During training each state, the output is assumed to have a single Gaussian distribution. After

getting the state probabilities for each of the states, Viterbi algorithm is used to maximize the

probability of the observed data. The best path output from the Viterbi algorithm is used to

do segmentation of the sections. Thus now for every audio, the audio waveform is transformed

into a time series with a specific value for every beat position. The authors have clearly depicted

the need for the correct choice of the HMM states in bringing out specific patterns in the au-

dio. Every HMM state will not correspond to a particular segment in the audio structure but a

collection of states will contribute to it. Thus at each of the beat positions the histogram of the

HMM states is computed. Manual segmentation is used as a reference template for histogram

distribution and then for each of the beat locations the nearest reference template is assigned as

label for that beat.

Another paper that uses various dimensions of the audio such as timbre, rhythm, harmony,

energy via MFCC, rhythmogram, chroma, short time energy representations is [9]. This paper

emphasizes the performance improvement obtained in the segmentation if we have multi-track

audio data. The main idea is that each of the features may get affected in polyphonic recordings

which is generally unaffected in separate recordings. They also start with the beat tracking

algorithms and then assign, for each of the beat positions a feature vector. Multiple features

are extracted for each of the tracks like 13 MFCC, 12 chroma vector, 200 rhythmogram based

and 1 RMS energy based features. For each track we get 226 features and for each sub-category

of features, SDM is computed separately. Each of these SDMs is convolved with appropriate

kernel to obtain the novelty score. In order to get the best performance, each of the SDMs

computed are given different weights out of 1, 10, 100. A final SDM is obtained after addition

of individual SDMs and then its performance is computed for different combination of weights.

For a particular genre, the weights which give the best performance are chosen as the final model.

In case of Hindustani classical music, segmentation problem was tried using rhythm as a

feature by [2] for Khayal vocal concert as well as tabla concert. In Hindustani music the changes

in rhythm can be perceived when the tabla performer improvises from the basic rhythmic cycle

or when the rhythmic cycle itself changes. Tabla onsets are detected, by processing in the band

of prominence of tabla strokes using onset detection function, thus giving a representation of

rhythm. The visual representation of the feature using a self distance matrix reveals the rhyth-

mic structure of tabla solo concert, having tabla as the lead instrument, and gives boundaries of

the major segments of the Khayal vocal concert. Spectral features combined with the auditory

processing motivated bi-phasic function achieved good time localization of onsets for polyphonic

audio. The changes in self distance matrix are detected by correlating a checker board kernel

of width relevant for desired temporal resolution across its diagonal to get the novelty score

depicting the changes. Here, however, evaluation of the system was not presented and it was

mentioned in the paper that the subsections within the Khayal vocal concerts cannot be seg-

4

Chapter 2. Literature Survey 5

mented using rhythm alone.

Verma et. al.[1] have worked with Hindustani instrumental concerts and Dhrupad vocal

performances and have combined various features dealing with different musical aspects of the

audio like the chroma, tempo and energy based features. They have proposed conversion of

features into their posterior probabilities to make them robust to local variations. Limitation of

their work is that they use a dataset which has little to no variability in the number of sections

and skipping or repetition of sections in the concerts is not seen.

Ranjani et. al [3] carried out classification of Carnatic audios into their musical forms. They

do not perform segmentation but perform labeling of audio which represents a segment. Hierar-

chical classification was used to label the segment to which the given audio belongs using absence

or presence of rhythm as a major cue. This can be used in Khayal vocal concert segmentation

task after obtaining the boundaries to achieve final label names using the feature value distri-

butions.

Since our task involves the detection and segmentation of a specific named section of the

concert, we need to invoke both segmentation and supervised classification methods. Musically

motivated features and methods are our chosen approach given their potential for success with

limited training data [10]. The challenges to taan detection are the polyphonic setting where we

want to focus on the vocal signal, and designing distinctive features that are artist and concert

independent. Given that pitch modulations are the prime characteristic of taan, reliable pitch

detection with sufficient time resolution is necessary. Finally, we need to convert the low level

analyses to annotation that closely matches the musicians’ labeling of taan episodes from a per-

formers point of view. Towards these goals, we use a vocal source separation algorithm based

on predominant-F0 detection [4]. Features designed to capture the characteristic of rapid but

regular pitch and energy variations of the voice are presented. A frame level classification at 1sec

granularity is followed by a grouping stage with the goal of emulating the subjective labeling of

taan by musicians as extended regions that occur at salient positions in the concert.

5

Chapter 3

Segmentation System Overview

Figure 3.1: Simplified Block Diagram for detecting sections in Khayal vocal concert

We want to segment out different sections in a Khayal vocal concert using their acoustic

characteristics as discussed before. The final aim is to get correct section boundaries along

with their labels. For that we propose modules as seen in the block diagram of Figure 3.1

consisting of two main blocks, viz. feature extraction block and change detection and labeling

block. The feature extraction block deals with observation of acoustic characteristics and using

the existing features or coming up with new features which will bring out distinction between

various Khayal vocal concert segments. The pre-processing block will process the audio and give

it in simplified form to perform feature extraction. For example, if features are to be extracted

on vocal pitch alone, then pre-processing will perform vocal pitch extraction, if rhythm related

features are needed then tabla strokes enhanced in a sub-band will be provided. The change

detection and labeling block deals with means to come up with boundaries between sections

using the existing framework of self-distance matrix (SDM) and the subsequent modules which

seem to be suitable for the problem. The challenging aspect is to come up with section names

and detecting their recurrence. The detailed block diagram is given in Figure 3.2

6

Chapter 3. Segmentation System Overview 7

Figure 3.2: Block Diagram for the proposed system of detecting segments in Khayal vocalconcert

3.1 Feature Extraction Block

This block includes pre-processing step followed by the feature extraction step. Each of them is

explained in the following sections.

3.1.1 Pre-processing

Since vocal pitch is the primary cue for various sections, we need to extract it reliably to further

compute features on it. Also tabla onsets might be a cue for broad level sectioning of Alap,

Bada and Chota Khayal sections, so we need source and instrument sound isolation to extract

features separately on them. The pitch extraction of the dominant source is first carried out

using the technique of [4]. The assumption is that the vocal melody being in the lead will be

having the strongest contribution to the spectrum. Thus by extracting the predominant F0, it

is expected to get melody corresponding to the voice where both voice and Secondary Melodic

Instrument (SMI) are present. This is further used in dominant source spectrum isolation block

to isolate the dominant source spectrum by reliably extracting sinusoidal partials using main-

lobe matching technique. Line spectra are obtained at each analysis frames by searching in the

vicinity at multiples of location of the detected pitch.

Features are extracted on the source isolated spectral envelopes as they are less dependent

on the pitch of the source and represent the source timbre better. The feature extraction block

as seen uses static and dynamic timbral features as also dynamic F0-harmonic features as in

[11]. Using only selected features for the genre under consideration i.e. Hindustani classical,

and applying unsupervised classification algorithm like k-means clustering, labels are obtained

for the frames as belonging to either the voice or the instrumental regions. An unsupervised

7


algorithm is used as the nature of the acoustic characteristics is clearly different for voice and

instrument regions within an audio. Further from here we can extract the pitch corresponding

to the vocal regions alone to give it to the feature extraction block.

The accuracy for the Singing Voice Detection (SVD) using the unsupervised classification

algorithm of k-means clustering and genre specific features was seen to be 83% in a previous

study. Further experimentation and evaluation on the database in consideration for this thesis

is explained in the section 4.2. The pitch extraction algorithm by [4] is superior to others for the

case of Indian classical music with minor occasional octave errors. Any other errors as maybe

observed during the course of the work also need to be taken care of, to improve the pitch

extraction so that reliable feature extraction is possible for detection of sections.

3.1.2 Feature Extraction

We can derive features for the purpose of detection of sections in Khayal vocal concert using the

extracted vocal pitch, or the source isolated spectrum belonging to the voice alone as per what is

suitable considering the acoustic characteristics of those Khayal sections. Other features based

on the timbre or the rhythmic onsets of the tabla can be derived directly from the audio using

pre-processing as per required. Acoustic characteristics and possible features for each section

have been discussed in section 4.2. Here since we have chosen the features such that they are

representative of a musical section, we expect them to be homogeneous within sections while

being in contrast across sections. This contrast among feature values across sections can be

captured by giving the features to the change detection and labeling block.

3.2 Change Detection and Labeling Block

This block is generic and can be applied to any set of features irrespective of the style of music,

as the motive of the block is to detect change between contrasting sections. But for the purpose

of labeling, some changes might be required. The first component in the change detection and

labeling block is classification using features. The posterior probability values obtained from

the classification can be used for the computation of Self-Distance Matrix (SDM) and novelty

score. The SDM can be computed on the features directly or by transforming them to their

posterior probability values. This transformation was seen to be effective in case of segmentation

problem of Dhrupad vocal and instrumental alap audios [1] as seen in Figure 3.3 and hence can

be investigated for use in Khayal vocal concert as well. In case of [1] the sections to be identi-

fied are fixed in number as also they occur sequentially, hence the labeling of the sections was

not a major problem. They have used unsupervised GMM classification to convert the features

into posterior probability values, with the number of classes in the classifier to be equal to the

number of sections. In case of detection of sections in Khayal vocal concert, though the number

of sections are fixed, they do not occur sequentially and generally repeat over the concert in a

way that can not be generalized. Also we need to finally label the sections which will be not

straight-forward if we use unsupervised classification methods. Hence, the preferable way in this

case to convert features into posteriors using a supervised classification algorithm and use the

labels obtained for further post-processing. An SDM is computed on the features / posteriors

8


using D(i, j) = d(xi, xj) for i, j�{1, 2, .., N} where distance function d specifies the distancebetween two frames xi and xj . Typically used distance measures are Euclidean distance or the

element wise dot product [12]. SDM provides with visualization of how distinct the features are

in different sections and which sections repeat. As seen in Figure 3.3 b clearly 3 blocks are seen

corresponding to alap-jod -jhala in the SDM.

Points of high contrast in the SDM can be detected by convolution along the diagonal with a

checker-board kernel [12]. The dimensions of the kernel have to be decided depending upon the

time scale of the section to be detected. A one-dimensional plot results from this convolution

and is called a novelty score whose peaks indicate the contrasting points which might be at the

locations of section boundaries due to suitable feature selection. The peaks in the novelty score

might be closely spaced and spurious. The detection of peaks is thus done in the peak detection

block using ‘local peak local neighbourhood’ search as proposed in [7]. Even after this process-

ing, the novelty score can still have multiple peaks thus resulting in over-segmentation. At this

point, we know the labels and hence we can name the sections between two peaks. Multiple

sections can be further merged in the post-processing block using rules derived from musicians’

annotations. For example, through observation of maximum duration between two same sections

ignored by the musicians to merge the sections.

Figure 3.3: SDM (top) and novelty functions (bottom) for a sitar concert computed by(a) acoustic feature vectors (b) posterior feature vectors as depicted by [1]

The final peaks thus obtained will represent the boundaries of the sections and the region

between them will have labels in case of supervised classification. In case of the unsupervised

classification as proposed in [1], on the other hand, we would still have to determine the labels

for the sections as also detect repetition of the sections. Since the features are musically derived,

we can make use of the feature values along with the boundary points to come up with some

rule based decision or hierarchical classification as done by [3]. Another approach could be using

9


Hidden Markov Model (HMM) to detect repetition but at the level of section rather than at

frame level [13]. While image processing based methods to detect repetitions have been reported

in [6], the method in [3] seems to be attractive due to its simplicity.

10

Chapter 4

Database Description

We have 102 concert audios of 24 artists from commercially available CDs which span number of

gharanas of vocal music. The audios were stored at 16kHz sampling rate with 16 bit resolution

and in single channel format. The total duration of the audios is 51 hours with 66 ragas covered

in the recordings. The concert audio with highest duration is of 1hr and 3mins while smallest

audio is of 15mins while the average audio duration is of 30mins. The duration of the concert

sections in the audio will thus vary to a lot of extent. The number of recordings of artists and

their gharana can be seen in Table 4.1. As can be seen there are a lot of artists belonging

to more than one gharana. The database was chosen such that almost all the gharanas were

covered in it. It was observed that all the artists in the database render akar taan . To keep

the overall percentage of bol taan and sargam taan at par with the akar taan, the number of

concerts of artist Jasraj, Ajoy Chakraborty and Kishori Amonkar were kept more as they were

seen to take sargam and bol taan. Inspite of this, the percentage of sargam taan considering

all the concerts is just 1.6% while the percentage of akar taan is 13.3% and that of bol taan is 3%.

The audios were selected considering that they had all the major sections i.e. the unmetered

alap, Bada Khayal, Chota Khayal and their improvisation sections. These divisions within the

Khayal performance are explained in detail in section 4.2. As discussed earlier, the artists may

skip sections or choose the tempo of the performance depending on the time constraints. Gen-

erally, the taan section marks the end of a Bada Khayal and a Chota Khayal concert in any

performance but an artist might just skip the taan section after the Bada Khayal and might

render it directly at the end of the Chota Khayal. This was observed particularly in case of

shorter duration audios of 15-20mins duration.

4.1 Database Subsets for Evaluation

Detailed experimentation has been performed using different specialized subsets within the

database. The subsets created and their purpose is described as ahead with their summary

in table 4.2.

11

Chapter 4. Database Description 12

Table 4.1: Distribution of clips per artist and the gharana of the artists

Artist No. of Clips Gharana

Aarti Ankalikar 2 Agra, Gwalior, AtrauliAshwini Bhide 1 Jaipur-AtrauliAjoy Chakraborti 12 PatialaAslam Khan 4 AgraDattatreya Velankar 2 Gwalior, KiranaGirija Devi 4 BanarasGauri Pathare 2 Jaipur, Gwalior, KiranaHirabai Barodekar 4 KiranaJasraj 24 MewatiJitendra Abhisheki 1 Agra, JaipurJayateerth Mevundi 1 KiranaJagdish Prasad 4 PatialaKishori Amonkar 9 Jaipur, Bhendi BazarKaushiki Chakraborti 4 PatialaKaivalyakumar Gurav 2 KiranaKumar Mardur 2 KiranaManik Bhide 2 Jaipur-AtrauliMani Prasid 2 KiranaMalini Rajurkar 3 GwaliorPrabha Atre 10 KiranaPrabhakar Karekar 2 Agra, GwaliorRaghunandan Panshikar 1 JaipurUlhas Kashalkar 2 Gwalior, Jaipur, AgraVeena Sahastrabuddhe 2 Gwalior, Jaipur, Kirana

12


4.1.1 Database Subset of 32 Concerts

This contains an almost equal mix of male and female artists, and different ragas from the 102

concert database. The total duration of the 32 concerts is of 17 hrs. This database is needed

for evaluation of 3 tasks that need to be performed before proceeding to detection of taan:

i) Singing Voice Detection (SVD):Taan features are derived from the melody alone and

rely on the accuracy of the vocal melody extraction. To extract melody, we need to first identify

the regions where vocal melody is present. This can be done using the SVD features as pro-

posed by [11] which is considered to be state-of-the-art for Hindustani classical music. We need

to evaluate the performance of the SVD algorithm and look into possible improvements that

can be done. This subset data has been hence annotated for Vocal and Instrumental regions for

SVD evaluation. The total duration of the annotated data is 17hrs out of which the duration of

vocal regions is 12hrs. The output of the SVD algorithm can be compared with this annotated

ground truth to obtain the accuracy.

ii) Obtaining finer ground truth within taan section (SAD evaluation): If we ob-

serve a taan episode, there are some non-taan movements also occurring between the taan move-

ments as seen from Figure 5.3. For initial evaluation, we do not want the classifier accuracy to

get affected by these non-taan regions coming between the taan section. In-order to facilitate

this, the finer frame-level markings within the taan section were obtained automatically using

the Speech Activity Detector (SAD)[14]. The evaluation of SAD was done on this subset data.

iii) Inspection of tempo invariance of features: As described in the previous chapter,

over the concert duration, gradually, the tempo keeps increasing. We want to inspect if the

features proposed are invariant to the tempo variation. Generally an abrupt change in tempo is

seen at the start of Chota Khayal section. The timing of this starting point of Chota Khayal can

be used for this purpose. This has been addressed in detail in section 6.3

4.1.2 Database Subset of 96 Concerts

Among the 102 concerts, 32 concerts were evaluated for SVD accuracy. Nonetheless, taan sec-

tions in all the concerts were checked to see if pitch is not missing for major parts in taan sections.

It was observed that absence of pitch contour though vocal melody was present in the audio was

occurring for 6 concerts among the 102 database. Detailed analysis of this is presented in the

section 5.1.

4.1.3 Database Subset of 22 Truncated Jasraj Concerts

In the 96 concert database, 22 concerts belonging to artist Jasraj were used initially to observe

the performance of the supervised classification algorithm for detection of taan section. The

algorithm was subsequently tested on larger database of the 96 concerts. This subset was

created to ease debugging of the various steps involved in detection of sections and to obtain

optimal operating point to be used further on the larger database.

13


4.1.4 Musicians’ Annotation: 24 Concert Subset

The end goal of this work is to mark the taan episodes which are meaningful to the musicians.

It is important to obtain the taan section markings from different musicians and compare the

differences to come up with the logic behind their annotation. The differences were mainly

in the taan episode marking in terms of allowable vocal non-taan regions between consecu-

tive taan sections or the instrumental improvisation between the taan sections. This can be

quantified and used to obtain a simple heuristic, to achieve the highest level of grouping of

taan sections, by examining the region of audio separating every two detected taan segments in

the musicians’ annotation. For this purpose a concert of each artist was chosen to put together

this subset dataset of 24 concerts. Taan boundary markings were obtained from 3 artists and

observations were carried out on them. In general, among all the markings it was observed that

the mukhada occurring at the end of a rhythmic cycle was combined in the taan episode. It

was seen that at the most 10sec of vocal duration which corresponds to non-taan was combined

in the taan episode. Also instrumental improvisation occurring between rhythmic cycles was

considered as a part of taan if its duration was less than 50secs. These insights were used to

combine consecutive taan segments.

Table 4.2: Summary of database subsets

DatabaseSubset

Contents Purpose

Subset A 32 concerts

SVD and SAD evaluation,proving tempo invariance of features,comparison of taan detection performanceusing different pitch detection algorithms

Subset B22 Jasraj concerts (testing)+35 other maleartists’ concerts (training)

truncated concerts used forfinding the f-measure,initial testing of the system

Subset C24 concerts(1 concert per artist)

Musicians’ annotation forobtaining grouping heuristics

Subset D 96 concertsEvaluation over the entire dataset,comparison of the proposed methodwith [1]

4.2 Khayal Concert Details

Khayal concert is most popular form in Hindustani classical vocal music and is based on the

theme of a raga. A raga is not just a scale of allowed notes but also has the motifs associated

with it. The vocal artist uses the raga as the theme of the performance and is accompanied by

tabla for rhythm, harmonium or sarangi for melodic accompaniment and tanpura for keeping

a reference to the tonic. The role of the melodic accompaniment is to follow the main melody

while the role of tabla is timekeeping. The vocal artist is the main performer of the concert and

he has the liberty to give a few minutes in the concert to the tabla and harmonium artist for

showing their skills at improvisation.

14


4.2.1 Khayal Concert Structure

The aim of Khayal is to elaborate on the idea of the raga via motifs and note by note elaboration

as perceived by the performing artist. In a typical Khayal vocal concert the artist chooses a

raga and starts introducing it through motifs in the alap . This introductory alap is not accom-

panied by tabla i.e. the percussive accompaniment of the concert. The alap atleast lasts for half

a minute and may also be rendered for more than 10mins depending on the gharana i.e. the

school of the artist. This is followed by Bada Khayal composition and its improvisation section

where the tabla sets in with a particular tala in a slow (vilambit laya) or medium (Madhya laya)

tempo as per what is suitable for the bandish i.e. is the composition selected by the artist. The

bandish generally comprises of four lines of poetry. At the start of the Bada Khayal the artist

renders the first 2 lines, called the Sthayi, of the composition once or twice. The Sthayi is limited

to the middle and lower register [15].

After this the artist starts with the improvisation of the Bada Khayal by taking first the

alap which can be rendered using the lyrics of the composition or vowel /a/. The next two lines

of the composition, called the Antara, have melody in the second part of the middle octave and

higher [15]. Hence depending on the raga elaboration followed by the artist, it can be taken

when the artist is near those notes. After the alap the artist plays around with the rhythm and

melody in the the baat section which can again be rendered using lyrics of the composition (bol),

note names (sargam) or vowel /a/ (akar). This section may be present or absent depending on

the artist. After baat , follows the taan section which can be rendered using again the lyrics, note

names or vowel /a/. Rendering of the taan section marks the end of the Bada Khayal improvi-

sation. An artist may sometimes prefer to not take taan section in Bada Khayal improvisation

but that is rare. Then the concert takes on a faster tempo relative to Bada Khayal by moving on

to Chota Khayal composition and its improvisation sections which are same as Bada Khayal i.e.

the alap , baat and taan. Many times since the rhythm has entered fast tempo, the artists prefer

to skip the alap and baat section after the rendering Chota Khayal composition and take the

taan directly. Many a times the artists prefer to take a tarana instead of Chota Khayal but the

improvisation sections remain the same as that of Chota Khayal. The tarana has no apparently

meaningful lyrics and has words like ‘dirdir’, ‘tanana’, ‘dim’, ‘tom’, etc. as well as the bols of

tabla sound. Individual acoustic and musical characteristics of various Khayal vocal concert

sections is mentioned in Table 4.3 and their hierarchy is depicted in Figure 4.3. The sequence

of different sections in the entire concert of 27mins can be seen in Figure 4.1

The Figure 4.2 shows a rhythmogram which is a two dimensional time-pulse representation

with lag-time on y- axis, time position on the x-axis and the auto-correlation values of onsets vi-

sualized as intensity. The auto-correlation peaks give us an idea about the tempo in the concert.

As can be seen in the figure, as the auto-correlation peaks get closer, it can be interpreted that

the tempo is increasing. Throughout the concert, the tempo is seen to increase gradually within

the Bada Khayal with an abrupt increase in tempo (maybe also change in tala and raga ) for

Chota Khayal as seen in Figure 4.2. The tempo is not fixed for the Bada Khayal across concerts

to a particular value. Many artists tend to take tempo as slow as just 10bpm to as fast as 40bpm

which might seem like Madhya laya [16], while the drut laya ranges from 160-320bpm. The

improvisation sections may not follow any particular order but the above mentioned sequence

15


Figure 4.1: Approximate sequence of different sections in Khayal vocal performance ofraga Shree by artist Kumar Mardur

Figure 4.2: Rhythmogram of tabla onsets in raga Deshkar Khayal vocal performance byartist Kishori Amonkar as depicted in [2].

is a general trend. The melody improvisation is gradual and note by note with the ‘mukhada’

marking the end of a rhythmic cycle. The artist generally starts with improvising in the lower

octave, then the middle octave and then advancing to the upper octave. Melodic movements do

not span multiple octaves within a rhythmic cycle with the exception being the taan section,

where the artist tries to show off his mastery over the voice. According to [17] the various sec-

tions do not fall into rigid divisions but there might be occasional overlap between the sections.

Since the sections according to musical context defer in proportion, placement and quality of

impact, one section cannot be mistaken for other.

4.2.2 Taan Section

In this work, our focus is on segmenting taan sections that are melodically salient i.e. the

sequence of melodic phrases or notes is rendered in a characteristic melodic style. The notes

may be articulated in various ways including solfege (sargam taan) and the syllables of the lyrics

(bol taan). Most common however is the akar taan , rendered using only the vowel /a/ (i.e.,

16


Figure 4.3: Various Sections in Khayal concert are depicted with the sequence being thealap without percussion followed by Bada Khayal and its improvisation component

as melisma). The sequence of notes is relatively fast-paced and regular, produced as skillfully

controlled pitch and energy modulations of the singers voice similar to vibrato. But unlike the

use of vibrato which ornaments a single pitch position in Western music, the cascading notes of

the taan sketch elaborate melodic contours like ascents and descents over several semitones. The

melodic structure is strictly within the raga grammar while the step-like regularity in timing

brings in a rhythmic element to the improvisation in contrast to the (also improvised) alap

sections. Apart from showcasing the singer’s musical skills, one or more taan sections typically

contribute to the climax of a raga performance and therefore serve as prominent musicological

markers. These unique characteristics of the taan section motivate us to investigate it further

for detection of this section. Table 4.3 provides with the musical and acoustic characteristics of

the taan section. All the concerts may not have all the taan types. It was seen that the akar

taan is rendered by artists over all the schools of music while the least rendered type was sargam

taan . The minimum percentage of taan section in the database was found to be 3.9% while

38% was the maximum taan percentage within a concert considering the total concert duration.

On an average 18% of the concert is taan section. The minimum duration of the taan section

is 5.6sec across all the concerts. Database for the task of detection of taan and pre-processing

step evaluation is described in next section.

4.3 Annotation Methodology

Keeping in mind the musical characteristics mentioned in table 4.3 as also the acoustic charac-

teristics we can mark the section in Praat [18] giving us the time instances of the start and end

time of the section. Before starting with the annotation, the musicians were explained the need

for annotation of taan section i.e. for easy navigation in a concert to go to a particular section.

The musicians were asked to annotate the taans in similar way as though they were pointing

out to their students the locations of taan episodes for their study of taan . Taan like movement

might occur in other sections for short intervals as well, but does not form the taan section.

Also, the taan section can occur multiple times and the artist has the liberty to render it any-

17


where in the concert. Multiple passes near the boundary were thus allowed to finalize it as a

valid boundary. The rhythmic cycle is important here as the tempo is slow in the initial part

of the concert and fast in the later part. If majority of a rhythmic cycle contains taan like

melodic movements which are followed by more such movements over subsequent cycles then it

was marked as a taan section. A taan section thus comprises of cluster of taan like movements

occurring over multiple cycles. The section end is marked if a different homogeneous section

other than taan episode starts.

18


Table 4.3: Khayal concert sections with musical characteristics and acoustic correlates

Sections Musical Characteris-tics

Acoustic Charac-teristics

Features

UnmeteredAlap

Tabla : (a)Absent,Unmetered, non-pulsated section VocalMelody: (b) Introduc-tion of raga which is aslow rendition of raga-motifs with more steadynotes. (c) Dependingon raga nature, artistgenerally will start frommiddle octave Sa, thento lower octave, movingback to middle octavethen to higher octaveand again ending inmiddle octave Sa. (d)Rendered only withvowel a.

(a) Wideband event ofhitting of tabla ab-sent as no percussionis present. Presenceof only voice of leadartist and accompani-ment. (b) Long steadypitch at frequenciescorresponding to notelocations of raga . (c)Evolution of pitch val-ues in alap can be ob-served (d) Harmonicsat formant locationscorresponding to vowel/a/ appear dark.

(a) Absence of tempoin rhythmogram oftabla onsets. (b)Steady Note measureon pitch values ifrhythmogram is notsufficient. (c) Shorttime histogram can beused to see the notebeing emphasized ateach time interval andtrend over time. (d)Spectral centroid

BadaKhayal

Tabla : (a) Percus-sion sets in. VocalMelody: (b) Dis-tinctive features ofa raga are displayedthrough compositionhaving two parts, viz.sthayi and antara. (c)Usually the first line ofsthayi called ‘mukhada’serves as a recurringtheme in a performanceand gives a cue for the1st beat in a rhythmcycle. (d) Sthayi :Melody is in 1st partof middle octave andpart of lower octave.Antara: Melody is in2nd part of the middleoctave to upper octaveand beyond. Tempo:slow or medium

(a) Wideband event ofhitting of tabla at reg-ular intervals seen inspectrogram. (b) Gen-erally change of pitchat the beats. Presenceof long held pitchdue with slow tempo.Lyrics comprising ofvowels and consonantsbring a break in anote being renderedcontinuously. Changein formant locationwith change in vowels.(c) ‘mukhada’ melodicphrase is generallyless variable in termsof pitches used andcan be identified toget the start of cycle.(d) Evolution of pitchvalues in alap section.

(a) Total section ofBada Khayal andits improvisationcan be separated us-ing rhythmogram oftabla onsets. (b) Justthe bandish section canbe separated using cueof lyrics through spec-tral centroid on sourceisolated spectrumas it comes immedi-ately before alap withtabla and immediatelyafter alap withouttabla . (c) ‘mukhada’identification [19]can be used (d) Shorttime histogram can beused to see the notebeing emphasized ateach time interval andtrend over time.

19




Features

ChotaKhayal

Tabla :(a) Tempo is fast(in comparison withBada Khayal ) VocalMelody: (b) Similarto Bada Khayal it is acomposition with sthayiand antara. Tarana canalso be taken which usessyllables like ta, na, de,re, dim instead of lyricsor even tabla bols.

(a) Interval be-tween tabla strokesis less than that inBada Khayal. (b)Pitch not held forlong duration (relativeto Bada Khayal).Increase in pitchornamentation as com-pared to Bada Khayal

(a) Rhythmogramof tabla onsets canseparate Chota fromBada Khayal. (b)Spectral Centroidand its delta as con-sonants occur inquicker succession ascompared to that inBada Khayal wherevowels stretched onlong notes.

Alap /Vistar(Akar /Bol)

Tabla : (a) Slow tempo.Percussive fillers more.Vocal Melody: (b)Slow elaboration of theraga through sequenceof melodic phrases withemphasis on restingnotes. (c) Melodymainly remains in 1stpart of middle octave.(d) Elaboration doneusing either vowel/a/(akar ) or the lyrics ofthe composition (bol ).Emphasis on melodicpatterns, variation ofmotifs.

(a) Wideband eventof hit on percussiveinstrument visibleat large separated in-stances as compared tothat in Chota Khayal.Fillers present. (b)Constant pitch heldfor a long time. (c)Pitch remains inmiddle octave. (d)Formant changes aslyrics are uttered.

(a) Rhythmogram ontabla onsets will not beable to distinguish thissection from others inBada Khayal impro-visation section. (b)Steady note measureto distinguish frombaat and taan sec-tion. (c) Short timehistogram can be usedto see the note beingemphasized at eachtime interval and trendover time. (d)Spectralcentroid to distinguishbol from akar .

20




Features

Baat /Layakari(Sargam/ Bol)

Tabla : (a) May mimicthe patterns of singer orplay normally depend-ing on context. VocalMelody: (b) Rhyth-mic improvisations us-ing names of the notes(sargam ), or lyricsof composition (bol ).Stress given at the beatsby doing note changesor lyric changes there.Speed of bol/sargamcan be same, or twiceor any multiples of thetempo but not as fast asin taan . Playing withnotes and lyrics withemphasis on rhythm.(c) No long held notes.

(a) Regular intervalconsonant breaks inthe spectrogram. Notetransition at percus-sive hit. (b) Pitchmodulation, if any,moderate as comparedto taan . No long heldpitch.

(a) Rhythmogram onvocal onsets will beuseful after source iso-lation. Spectral cen-troid after source iso-lation (b)Pitch modu-lation captured via en-ergy ratio with differ-ent range of frequen-cies corresponding tooscillations here

Taan(Akar /Bol /Sargam)

Tabla : (a) Tempomight get faster withbasic tala being played,without fillers. VocalMelody: (b) Rapidgamak like movementstaken using vowel /a/(akar ), lyrics of thecomposition (bol )or name of the notes(sargam ). (c) Thepatterns range overall the 3 octaves andthe section serves asclimax to the raga pre-sentation. Emphasison vocal skills. This ismost distinctive sectionof the Khayal rendition.

(a) Wideband eventof hit of percussivebeats at regular inter-vals (with no fillers)but at less time inter-val. (b) Rapid pitchoscillations. Rapid en-ergy fluctuations. Forakar , formants cor-responding to vowel/a/ look dark. Incase of bol the for-mants change gradu-ally, while in case ofsargam they changerapidly.

(a) Rhythmogramcannot distinguish thissection from others.(b) Frequency of os-cillatory pitch, rate ofzero crossings of meansubtracted energycontour. Spectralcentroid for formantchange detection inbol and sargam .

21

Chapter 5

Pre-processing

Before extracting the features, we need to extract the pitch corresponding to the vocal melody.

This we approach via slight modifications to the state-of-the-art singing voice detection (SVD)

algorithm [11]. Also another step required is to obtain finer ground truth for calculating the

frame-wise accuracy of taan detection as described in section 4.1.1. Each of the tasks is described

in detail as below.

5.1 Singing Voice Detection (SVD)

Taan features are derived from the melody alone and rely on the accuracy of the vocal melody

extraction. An important step for extracting melody is the detection of regions where the singing

voice is present so that melody can be extracted only in those regions as explained in section

3.1.1. This data Subset A has been annotated for Vocal and Instrumental regions for SVD

evaluation. The total duration of the annotated data is 17hrs out of which the duration of

vocal regions is 12hrs. If the artist takes any pause for taking breath then that is ignored and

included in the vocal region. Breath pauses are ones which have duration perceptually insignif-

icant (roughly less than 80ms) as referred to by [23]. The output of the SVD algorithm can be

compared with this annotated ground truth to obtain the accuracy.

The pitch extraction of the dominant source is first carried out using the technique of [4].

The assumption is that the vocal melody being in the lead will be having the strongest contri-

bution to the spectrum. Thus by extracting the predominant F0, it is expected to get melody

corresponding to the voice where both voice and accompaniment are present. The F0 is further

used to isolate the dominant source spectrum by reliably extracting sinusoidal partials using

main-lobe matching technique. Line-spectra are obtained by searching in the vicinity at multi-

ples of location of the detected pitch.

Features are extracted on the source isolated spectral envelopes as they are less dependent

on the pitch of the source and represent the source timbre better. The static timbral, dynamic

timbral and dynamic F0- harmonic feature categories were proposed in [11]. Here they have ex-

perimented with just 13mins of Hindustani classical audios and used supervised GMM in leave

one song out validation with four mixtures per class giving an accuracy of 84%. In this work,

22

Chapter 5. Pre-processing 23

we combine feature categories and feature selection is applied using WEKA toolbox [20]. In

[11] feature selection was applied within each feature category and supervised classification was

carried out. We get 11 selected features from among all the categories (total 85 features) by

using Best First search method in WEKA toolbox [20]. We explore the possibility of using unsu-

pervised classification algorithm as the nature of the acoustic characteristics is clearly different

for voice and instrument regions within an audio. A particular audio will always contain tabla

for rhythmic accompaniment and tanpura for giving tonic as reference, but the melodic accom-

paniment might be sarangi or harmonium. Also an artist might choose to use a swaramandal

for different reference notes to be played. In rare cases for the Chota Khayal section, the artist

might choose to use both tabla and mridangam for rhythmic accompaniment. Due to this, it

would probably be better to make a within clip decision for Vocal and Instrumental marking.

Unsupervised classification algorithm, k-means clustering, is applied and labels are obtained

for the frames as belonging to either the voice or the instrumental regions by using one of the

dominant feature to make the decision. Further from here we can extract the pitch correspond-

ing to the vocal regions alone to give it to the feature extraction block. Singing Voice Detection

(SVD) using k-means clustering and genre specific features was seen to give 88.30% accuracy

when evaluated on the data Subset A. Further, the SVD accuracy within the taan regions was

also evaluated so that there is certainty that there are no SVD related errors for the taan episode

detection. The frame-wise accuracy within the taan regions came out to be 89.20% for the data

Subset A. The labeling of the clusters obtained after k-means is of major concern. The labeling

is done using one of the features with highest individual classification accuracy namely, the nor-

malized harmonic energy.

Figure 5.1: Algorithm marked Vocal (V) and Instrumental (I) boundaries for the audioconcert with least accuracy. The highlighted “I” marking should have been “V”

The accuracy was obtained by comparing the framewise labels and it came out to be 88.30%

which makes our SVD algorithm reliable for using it further for extraction of pitch in vocal

regions alone. The analysis of two clips where the accuracy was below 70% shows that the spec-

trograms were washed out i.e. the vocal harmonics were not clear in the spectrograms and in few

instances the background accompaniment was loud. As can be seen in Figure 5.1 the highlighted

region does not look clear though the voice of lead vocal artist is present / audible in that region

of the audio. Tanpura suppression might help in some cases where loud tanpura is affecting the

decision of SVD since long steady notes are getting marked as belonging to Instrumental category.

23


Though the accuracy of SVD was seen to be high, we inspected all the taan regions in the

total database of 102 concerts. It was seen that there are particularly 6 clips which had problems

in SVD decision in taan region apart from the 2 error clips analyzed in the data Subset A. In

the 6 concerts that were eliminated from the 102 concerts for taan detection to create the data

subset D, there was presence of background vocals. Also there was presence of accompanying

instrument sarangi which has frequency range similar to human voice and is played in continuous

fashion unlike harmonium. There is hence confusion arising in Vocal and Instrumental decisions

made by SVD algorithm. Figure 5.2 illustrates one such case. Though the pitch of the sarangi

is played higher than the voice, the SVD is not able to give more weight to pitch feature as all

the features have equal weights.

Figure 5.2: Spectrogram where along with the lead artist there is presence an instrumentsarangi for accompaniment which has frequency range similar to human voice

The frame level accuracy of SVD was also calculated using supervised GMM classification

in leave one song out validation as was proposed by [11]. Four Gaussians were used to model

each class (Vocal and Instrumental) and the accuracy was found out to be 80.10% which is less

than that using k-means classification (88.30%).

5.2 Obtaining Finer Ground Truth within Taan Sec-

tion (SAD Method)

We observed that the musicians labeled taan based on the perceived intent of the performer i.e.

relatively short duration of instrumentals and other vocal styles that occur between taan episodes

were subsumed by the taan label (as in Figure 5.3). For the real-world use case, we would like

our automatic system to match the musicians labeling of the taan sections in the concert. At

the same time, for initial evaluation, we do not want the classifier accuracy to get affected by the

non-taan regions coming between the taan section as seen in the Figure 5.3. In-order to facilitate

this, the finer frame-level markings within the taan section were obtained automatically using

the Speech Activity Detector (SAD). The SAD method is completely unsupervised and is based

on the speaker diarization system [14].

The SAD was provided with our carefully derived pitch and energy based features. It is an

iterative process of classification, where separate Gaussian Mixtures Models (GMM) are fitted

to the frames classified as speech and non-speech (in our case taan and non-taan). Classification

24


is performed using these models on all the frames again. The process repeats and it stops when

there is convergence or till

Chapter 6

Taan Section Detection

Taan section is important to be identified as it gives a hint of the end of a Bada Khayal or

Chota Khayal improvisation part. Taan section’s typical characteristic is rapid oscillatory pitch

fluctuations as seen in Figure 6.1. Further information about the type of taans and the need

to use pitch based features for its detection is given in table 4.3. It is similar to vibrato in

Western music but with the rate of frequency modulation being slightly different and the fact

that the oscillations are not limited about just a single note. These oscillations are irrespective

of whether it is being rendered in Bada Khayal improvisation section or in Chota Khayal. While

there might be some changes in the nature of acoustic characteristics in other sections due to

gradual increase in tempo over the entire duration of the concert, the taan rate typically lies in

the range of 5-10Hz across the artists irrespective of the underlying tempo changes occurring.

A spectrogram of a typical akar taan can be seen in Figure 6.1 where the x axis is time

and y axis is frequency in Hz. Observing across various concerts, the pitch oscillation frequency

was observed to be between 5-10Hz. As per the conclusions of studies conducted in [21], in a

pedagogical scenario the rate of taan oscillations range 1.65 to 3.14Hz, but we are interested in

actual performance scenario where this rate is quite high. The vocal melody extracted from a

concert shows the rapid pitch oscillations in the taan section which starts after 1196sec in Figure

6.2 (a) as compared to steady notes and slow ornamentation before 1196sec. As can be seen in

Figure 6.2 (a) there are around 6 oscillations occurring in one sec interval of 1198sec to 1199sec.

Figure 6.1: Spectrogram of a part (7sec) of a longer akar taan section in raga Madhukaunsperformance by artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melodyare seen with the darker harmonics corresponding to vowel /a/. The beat tier indicatesthe tabla hits which are visible as vertical lines in the spectrogram.

26

Chapter 6. Taan Section Detection 27

Figure 6.2: Performance by Kumar Mardur of raga Shree where taan section begins at1200sec with (a) pitch variations and (b) energy variations depicted corresponding to onlythe voice.

Figure 6.3: Spectrogram of a part of sargam taan section in raga Madhukauns perfor-mance by artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melody areseen with the darker harmonics corresponding to phones in note names like Pa, Ni, Sa,Ga, etc. The beat tier indicates the tabla hits which are visible as vertical lines in thespectrogram.

Another characteristic is that the energy corresponding to the taan section fluctuates rapidly as

compared to that in other sections as seen in Figure 6.2 (b).

While rendering an akar taan, the artist generally sticks to the vowel /a/ and the formant

locations corresponding to it can be seen as dark lines in the Figure 6.1. When we observe

the spectrogram of sargam taan in Figure 6.3, a lot of movement of formants can be seen as

the singer utters the swaras or solfege while singing the oscillatory melodic movements. Thus

there can be seen a lot of ‘breaks’ in the melody line which correspond to the consonants being

uttered. The bol taan section in Figure 6.4 also shows movement of formants, but not as rapid

as in case of sargam taan. This is because the artist utters the consonants after larger time inter-

vals as compared to that in sargam section and uses the vowels in the lyrics for a longer duration.

27


Figure 6.4: Spectrogram of a part of Bol taan section in raga Madhukauns performanceby artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melody are seenwith the darker harmonics corresponding to phones of lyrics ‘Tuma bina kaun’. The beattier indicates the tabla hits which are visible as vertical lines in the spectrogram.

6.1 Pre-processing / Melody Extraction

Vocal pitch is the only cue for distinguishing the taan section, therefore, we need to extract it

reliably to compute features on it. The SVD algorithm accuracy is 88.30% as seen in section

4.1.1(i) which makes the decisions of Vocal and Instrumental for the case of Hindustani music

very reliable. The pitch extraction of the Vocal regions marked by SVD algorithm is carried out

using the technique of [4] henceforth referred to as PolyPDA in this thesis.

6.2 Feature Extraction

We try to capture the acoustic characteristics of the taan section after observation of spectro-

grams in a number of clips. Both the pitch oscillations and the variations in energy contour can

be captured to detect the taan sections out of all the sections as per our observations pointed

in Figure 6.2. Regular pitch oscillations can be seen in the taan section as opposed to the

non-taan section. Additionally, the fluctuations in the energy contour are seen to be high in

case of the taan section. The pitch values are calculated at every 10ms interval after pitch ex-

traction from the polyphonic audio corresponding to only voice. Features are first calculated at

smaller analysis frames and then averaged over larger texture windows. This helps to eliminate

taan like movements that might occur in other sections spuriously and obtain an average overall

behaviour. For our study we have considered texture window of 5sec which is slightly less than

the minimum duration of taan section. The texture frame hop was kept to be 1sec. The anal-

ysis frames are of 1sec with 0.5sec hop as also seen to be effective in [22]. As can be seen from

the Figure 6.5, averaging over the texture windows is essential to avoid spurious feature values

when a taan like movement occurs in non-taan section. The pitch contour is first interpolated

if silences less than 80ms are present to avoid breaks in the pitch in presence of consonants and

breath pauses [23]. The feature values are normalized to zero-mean and unit variance across

the concert. From Figure 6.5 it can be seen from the unnormalized feature values that the

distinction between taan and nontaan regions is quite clear.

28


Figure 6.5: The taan feature values for Energy Fluctuation Rate feature for entire concertof raga Shree performance by artist Kumar Mardur with the ground truth of taan sectionas red boxes. Features are plotted at (a) analysis window length 1sec and hop 0.5sec, (notexture window applied) (b) texture window length 5sec and hop of 1sec. The distinctionin the taan section feature values is more evident in (b) than in (a)

6.2.1 Pitch Based Features

The DFT spectrum (128 point) of 1sec segments (1sec = 100 pitch values at 10 ms sampling

of pitch) of the pitch contour is computed using a sliding analysis frame of 1sec with hop size

of 500ms. After calculating the DFT, features- Energy Around Maximum Amplitude in DFT

(EAMA) and Frequency Corresponding to Maximum Amplitude in DFT (FCMA) are extracted

from it. For calculating EAMA feature, 2 bins before and after the Maximum Amplitude in DFT

are considered ( 3.9Hz). This is done to overcome the bin resolution limitations. Equations for

FCMA and EAMA are as given in eq. 6.2 and eq. 6.3 respectively. Here, Maximum Amplitude

Value in DFT (MAV) of an analysis frame is given by eq. 6.1

MAV = max(|Z(k)|2) (6.1)

FCMA = fMAV (6.2)

EAMA =

kMAV +2∑k=kMAV −2

|Z(k)|2 (6.3)

where Z(k) is the DFT of the mean-subtracted pitch trajectory z(n) with samples at 10ms

intervals, and kMAVHz is the frequency bin closest to fMAV Hz.

29


Figure 6.6: Shows the (a) taan frame values corresponding to 2 features plotted in blackcircles for Bada Khayal and blue crosses for Chota Khayal (b) Gaussians plotted usingthe mean and variance of the taan features in Bada Khayal and Chota Khayal to visualizetheir overlap

6.2.2 Energy Fluctuation Rate

The energy values corresponding to only the vocal melody are also available at every 10ms along

with the pitch. To capture the variations in energy contour as seen in Figure 6.2(b), mean value

is first subtracted from the energy contour in an analysis frame of 1sec and then the number of

zero crossings detected are used as a feature value for that frame. Here as well, we use hop of

500ms for the analysis frame of 1sec and average these values over texture frames of 5sec with

1sec hop.

6.3 Inspection of Variability of Features with Tempo

As described in the 4.1.1(iii), over the concert duration, gradually, the tempo keeps increasing.

We want to inspect if the features proposed are invariant to the tempo variation. We use the

data Subset A to evaluate this. Generally an abrupt change in tempo is seen at the start of

Chota Khayal section. The timing of this starting point of Chota Khayal can be used for this

purpose. Figure 6.6(a) shows feature values plotted across the data Subset A with the high over-

lap between taan feature values in Bada Khayal and Chota Khayal section. The mean values

of Bada Khayal and Chota Khayal taan features are -0.0302, -0.0458 and 0.0761, 0.1151 respec-

tively for the scatter plot over 2 features, while the variance is 0.0325,0.0291 and 0.0507,0.0487

respectively. These feature values are close and their overlap can be visualized as seen in Figure

6.6(b) with the help Gaussian contours plotted using the mean and variance values mentioned

above.

Euclidean distance between the Bada Khayal and Chota Khayal taan features was calcu-

lated. Euclidean distance was also calculated of taan features within the Bada Khayal and

within Chota Khayal sections. Histogram of these distance values is plotted as seen in Figure

6.7. The mean (0.6) of the distance calculated of taan feature vector across Bada Khayal and

Chota Khayal was comparable with the within Bada Khayal and Chota Khayal distances’ means

(0.49, 0.59 respectively). Thus, both quantitatively and also from the figure it is evident that

30


Figure 6.7: Histogram of Euclidean distances in the feature values from lower tempoBada Khayal section and higher tempo Chota Khayal section are comparable with thedistances within Bada Khayal and those within Chota Khayal

the feature values are tempo independent.

6.4 Classification and Grouping using Posteriors

A frame-wise classification into taan and non-taan styles is carried out for all frames in the

vocal segments by a trained MLP network. We use a feed-forward architecture with a sigmoid

activation function for the hidden layer comprising 300 neurons. Training uses cross-entropy

error minimization via the error back propagation algorithm. We compare the frame level

accuracy of MLP with SVM classifier in the data Subset B using leave one song out validation.

As seen in table 6.1 the MLP is seen to perform better than the SVM, thus we choose to proceed

with MLP. We also give the accuracy using the Deep Belief Network (DBN) with various number

of hidden layers of 300 neurons. We used 2, 3 and 4 hidden layers and their accuracies were

93.54%, 93.29% and 93.45% respectively. Also the number of neurons was experimented with,

by changing it from 100 to 1000 in steps of 50. No significant change in accuracy was observed

with change in the number of neurons. Since the data is not large in number, we decided to use

less number of neurons and with just 1 hidden layer. The precision and recall for taan and non-

taan were similar to MLP results. Upon classification, the recall and precision of taan frame

detection with respect to the ground-truth can serve to measure the discriminative power of

the features. In our case however we seek to label continuous regions of the audio rendered in

taan style much as a human annotator would. This requires the grouping of frames based on

homogeneity with respect to the taan characteristics. Novelty detection based on a self-distance

matrix is an effective way to find segment boundaries [6].

We use a recently proposed approach to computing the SDM from the posterior probabil-

ities derived from the features rather than the features themselves [1]. The use of posterior

probabilities for computation of SDM helps to obtain enhanced homogeneity due to the reduced

sensitivity to irrelevant local variations. Euclidean distance between vectors comprising of the

posteriors is used for calculating the SDM. The posteriors are the class-conditional probabili-

ties obtained from the MLP classifier for each test input frame. Points of high contrast in the

31


Table 6.1: Comparison of frame-wise accuracies of SVM and MLP with Precision andRecall values for each class

SVM91.74%

Predicted MLP93.58%

PredictedPrecision Recall Precision Recall

Labeltaan 0.7018 0.8928

Labeltaan 0.7962 0.8216

non-taan 0.9768 0.9224 non-taan 0.9586 0.9647

SDM are detected by convolution along the diagonal with a checker-board kernel [12] whose

dimensions depend upon the desired time scale of segmentation. Considering the minimum

taan episode duration, this is chosen to be 5sec in the interest of obtaining reliable boundaries

with false negatives. The resulting novelty function is searched for peaks, representing segment

boundaries, using ‘local peak local neighborhood’ [7]. Whether a region between two detected

boundaries corresponds to a taan is determined by examining the majority of the frame-level

classification in that region. Finally, the highest level of grouping is obtained by examining the

region of audio separating every two detected taan segments. A simple heuristic is set up to

mimic the musicians’ annotation, as discussed in section 4.3, such that taan episodes separated

by non-taan vocal activity within 10sec are merged into a single section. The merging is also

applied if the separation corresponds to a purely instrumental region of duration within 50sec.

The intermediate step of frame-wise classification is working well using MLP as can be seen

from the accuracy of taan v/s non-taan classification. The evaluation of the grouping is described

in the next chapter.

32

Chapter 7

Experiments and Evaluation

Our ideal system would detect and segment taan sections similar to a musicians’ labeling. This

high level task is attempted by the sequence of frame-level automatic classification and higher

level grouping as described in section 6.4. We perform 2 types of experiments, first with the

smaller artist specific data Subset B for deciding the best operating point for conversion of MLP

posteriors into labels. The second experiment is using the parameters decided in the data Subset

B on the data Subset D.

We present experimental results on the performance of each of the modules. Frame-level clas-

sification is measured by the detection of taan in terms of recall and precision. Artist-dependent

and artist-independent training are compared within the 22 concert database. The frame-level

classification needs frame-level (i.e. 1sec resolution) annotation of taan presence or absence.

This is required both for the training of the classifiers as well as for reliable testing. The mu-

sician labels are not useful as such for this end due to the presence of non-taan interruptions

of significant duration within the musician labeled taan sections as seen before. Thus, for the

development of the frame-level classifier, we need a finer marking of taan segments. Since this

is a demanding task to carry out manually, we use a bootstrapped iterative approach of SAD

as described in section 4.1.1(ii). SAD evaluation showed that the frame-level labels so obtained

were indeed accurate and these were then used to train and evaluate the frame level classifiers.

The system is also evaluated after grouping, this time in terms of the match between the de-

tected segments and the subjectively labeled taan segments for each concert. We tabulate the

taan episode detection by reporting the number of over-, under-segmentation, exact detection,

false negatives and false positives.

Conventional measures of performance include computing of cluster purity, pair wise ham-

ming distance, boundary precision and recall, etc as detailed in [24]. Any two of the measures

carefully selected will give us an idea about the type of segmentation that is achieved. Since our

task is of detection, along with these measures we need to report the number of detections as

well. Cluster purity and boundary retrieval were used by [1] to report their segmentation eval-

uation. We also report the cluster purity and boundary retrieval values, as per these standard

evaluation measures, in table 7.1 for completeness and to emphasize that they are not enough

to give a picture of taan section detection. The high cluster purity values (close to 1 is good)

show that the detected taan and non-taan sections are homogeneous as also there is over and

33

Chapter 7. Experiments and Evaluation 34

Figure 7.1: Various scenarios that occur after grouping viz. (a) false positive, (b) over-segmentation, (c) exact detection, (d) false negative, (e) under-segmentation

under-segmentation but we do not get an idea about the number of detected taan sections. Same

is the case for the boundary retrieval values reported. The boundary retrieval values also reflect

the under and over segmentation but their number is not conveyed.

Table 7.1: Performance evaluation using conventional measures on (a) proposed systemand (b) GMM based system of [1] after grouping at frame-level

Boundary Retrieval Cluster PurityMethod Precision Recall acp asp k

(a) MLP based 0.448 0.578 0.768 0.779 0.773(b) GMM based 0.2187 0.4062 0.763 0.621 0.6869

To the best of our knowledge, however, no attempt has been made towards the problem of

using supervised classification for labeling of the segments and especially in the case of taan sec-

tion detection problem in Hindustani vocal concerts. It is not possible to compare our results

with any other method apart from [1] which works on Dhrupad and instrumental concerts. Here,

they do not label their segments which we have modified in the case of taan to perform labeling.

Also the conventional measures of performance will not be able to give a fair idea about the

number of sections marked in the ground-truth which have been correctly labeled and retrieved.

Along with the frame-level accuracy we measure the performance by including the number of

correctly retrieved taan segments and number of false positives. A section is said to be correctly

retrieved if there is an overlap of at least 50% of its duration with a detected segment. Also of in-

terest is the extent of over- or under-segmentation of the correctly detected taan sections. Figure

7.1 illustrates the different possibilities of mismatch that are observed between subjective labels

and automatically labeled sections. When subjectively labeled section is correctly detected, it

is observed that the onset and offset boundaries are always within 5sec of the corresponding

ground-truth boundaries indicating the reliability of the posteriors based detection of sections.

7.1 Evaluation on data Subset B

Our audio data subset B consists of 57 Khayal vocal concert recordings partitioned into two

distinct sets of 22 single-artist (Pt. Jasraj) concerts, and 35 multi-artist concerts (that do not

34

Chapter 7. Experiments and Evaluation 35

Figure 7.2: Shows (a) ROC by thresholding the posterior values obtained from MLP forleave-one-song-out case of 22 concerts and for 35 train and 22 test concert scenario and (b)SDM+novelty+Grouping for 35 train set and 22 test set scenario using MLP posteriors,the pink boxes indicate the musician marked ground truth, the red stars indicate the MLPlabels, the novelty score is black continuous contour, the green filled boxes show detectedtaan regions before grouping and the circled peaks connected by blue lines show the finaltaan episodes after grouping.

contain Pt. Jasraj). In both cases a number of different ragas are covered at various tempi. All

artists are male. The 22 concert set is treated as the test set with two different training con-

ditions: artist specific training via leave-one-song-out cross validat

Documents

Detection of Taan Sections in Khayal Vocal Concerts · 2021. 1. 13. · of the various sections, although the performance typically follows an established structure de-pending on