Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Detection of Taan Sections inKhayal Vocal Concerts
Submitted in partial fulfillment of the requirements
of the degree of
Master of Technology
by
Amruta J. Vidwans
(Roll no. 123076005)
Supervisor:
Prof. Preeti S. Rao
Department of Electrical Engineering
Indian Institute of Technology Bombay
2015
Amruta J. Vidwans/ Prof. Preeti S. Rao (Supervisor): “Detection of Taan Sections in
Khayal Vocal Concerts”, MTech. Degree Dissertation, Department of Electrical Engineer-
ing, Indian Institute of Technology Bombay, July 2015.
Abstract
Structural segmentation of concert audio recordings is very useful in music navigation and au-
tomatic summarization. It is particularly strongly indicated for Indian classical music where
concerts can extend for hours, and commercial audio recordings rarely provide timing details
of the various sections, although the performance typically follows an established structure de-
pending on the genre. The distinct concert sections have contrasting rhythmic, and sometimes
melodic, structures. The proposed work is concerned with the automatic segmentation of spe-
cific musical sections from the audio of Khayal vocal music concerts. The taan section, has a
distinct melodic character across concert, irrespective of the tempo. Our goal is to label the
taan sections using acoustic features that capture the melodic style. The features are derived
from low-level audio analysis including pitch and energy tracking of the singing voice.
The proposed system does binary classification of frames into taan and non-taan classes.
The posterior probability vectors obtained in the course of the statistical classification are used
in grouping stage. A higher time-scale smoothing as befits the concert section detection motive
is achieved by using change detection methods. The grouping stage uses heuristics derived from
a study of musicians’ annotations of taan episodes on a concert data subset. We evaluate the
system in two stages: by its frame level classification accuracy, and by reporting the number of
(over-, under-segmentation, true) detected, false positive and false negative taan episodes after
the grouping stage. We compare the results of our proposed method with an unsupervised seg-
mentation method showing that the proposed method achieves superior results over a database
of 96 concerts in terms of the giving fewer false positives.
Index terms: audio segmentation, taan detection, multilinear perceptron (MLP),
posterior probability, self-distance matrix (SDM), novelty score, Hindustani Clas-
sical music, Khayal vocal concerts
iii
Contents
Dissertation Approval i
Declaration of Authorship ii
Abstract iii
List of Figures vi
List of Tables vii
1 Introduction 1
2 Literature Survey 3
3 Segmentation System Overview 6
3.1 Feature Extraction Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Change Detection and Labeling Block . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Database Description 11
4.1 Database Subsets for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Database Subset of 32 Concerts . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.2 Database Subset of 96 Concerts . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 Database Subset of 22 Truncated Jasraj Concerts . . . . . . . . . . . . . . 13
4.1.4 Musicians’ Annotation: 24 Concert Subset . . . . . . . . . . . . . . . . . . 14
4.2 Khayal Concert Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Khayal Concert Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Taan Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Annotation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Pre-processing 22
5.1 Singing Voice Detection (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Obtaining Finer Ground Truth within Taan Section (SAD Method) . . . . . . . 24
iv
Contents CONTENTS
6 Taan Section Detection 26
6.1 Pre-processing / Melody Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.1 Pitch Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2.2 Energy Fluctuation Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Inspection of Variability of Features with Tempo . . . . . . . . . . . . . . . . . . 30
6.4 Classification and Grouping using Posteriors . . . . . . . . . . . . . . . . . . . . . 31
7 Experiments and Evaluation 33
7.1 Evaluation on data Subset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 Experiments on data Subset A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Experiments on data Subset D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8 Conclusion and Future Work 40
Acknowledgements 44
v
List of Figures
3.1 Simplified Block Diagram for detecting sections in Khayal vocal concert . . . . . 6
3.2 Block Diagram for the proposed system of detecting segments in Khayal vocal
concert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 SDM and Novelty for sitar concert . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Possible sequence of sections in Khayal vocal concert . . . . . . . . . . . . . . . . 16
4.2 Rhythmogram of tabla onsets in a Khayal vocal concert . . . . . . . . . . . . . . 16
4.3 Various sections in Khayal vocal concert . . . . . . . . . . . . . . . . . . . . . . . 17
5.1 Algorithm marked Vocal (V) and Instrumental (I) boundaries . . . . . . . . . . . 23
5.2 Example spectrogram of errors in SVD . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Example spectrogram of non-taan movement occurring in taan episode . . . . . 25
6.1 Spectrogram of an akar taan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Pitch and energy contour of taan segment . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Spectrogram of a sargam taan segment . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Spectrogram of a bol taan segment . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 The Energy Fluctuation Rate feature for an audio plotted (a)without texture
window applied (b) texture window of 5sec with hop of 1sec . . . . . . . . . . . . 29
6.6 Taan frame feature values for Bada Khayal and Chota Khayal . . . . . . . . . . 30
6.7 Histogram of Euclidean distance between feature values to show tempo invariance
of taan features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.1 Various scenarios that occur after grouping . . . . . . . . . . . . . . . . . . . . . 34
7.2 Shows (a) ROC by thresholding the posterior values obtained from MLP (b)
SDM+novelty+Grouping stages in one of the Subset B audio . . . . . . . . . . . 35
7.3 Comparison of pitch contours obtained from PolyPDA and Melodia plug-in . . . 37
7.4 Comparison of taan episodes detected after SDM+novelty+grouping using pos-
teriors from (a)MLP and (b)GMM . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 Errors in taan episodes detection after SDM+novelty+grouping using posteriors
from MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vi
List of Tables
4.1 Distribution of clips per artist and the gharana of the artists . . . . . . . . . . . 12
4.2 Summary of database subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Khayal concert sections with musical characteristics and acoustic correlates . . . 19
6.1 Comparison of frame-wise accuracies of SVM and MLP with Precision and Recall
values for each class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.1 Performance evaluation using conventional measures on (a) proposed system and
(b) GMM based system of [1] after grouping at frame-level . . . . . . . . . . . . . 34
7.2 Taan detection performance after grouping for 35 train and 22 test concert sce-
nario (92% of taan detected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Shows taan detection performance after grouping for (a) pitch extracted from
Melodia plug-in (33% of taan detected) (b) pitch extracted from PolyPDA (82%
of taan detected) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.4 Shows taan detection performance after grouping using (a) MLP on data Subset
D (80% of taan detected)(b) GMM on data Subset D (86% of taan detected) . . 38
7.5 Frame level classification accuracy, precision and recall using MLP on data Subset
D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vii
Chapter 1
Introduction
Audio segmentation systems aim at representing the audio at a broader level via labeled musical
sections. In case of Hindustani classical music there is much variability with respect to dura-
tion of the concert, the number of repetitions of the segments, the freedom of improvisation,
the school of music and the artist’s individuality of rendering the concert. The most popular
performance in a Hindustani classical vocal concert is of the Khayal genre. The problem of
segmentation in Indian Classical music has been approached using rhythmic features in the past
for tabla concerts and a few Khayal vocal concerts [2], but lacks in evaluating the performance
of the system or structural analysis as it gives only the visual representation. There are no
papers that we know of addressing the problem of Khayal concert segmentation apart from [2].
Attempts were made by [3] for the case of Carnatic music for classification of an audio into a
section rather than segmentation within a concert, where the presence or absence of percussion
was an important cue. Review of audio segmentation systems for different music styles is done
in detail in Chapter 2
The segmentation of Khayal concert holistically is possible by using other dimensions of the
audio, i.e. pitch, energy and timbre along with rhythm. We study the musical and acoustic
characteristics of various Khayal sections and possible features that can be used for them. For
certain Khayal sections, like the taan section in particular, the nature of the pitch variation is
the only cue that can be used for distinguishing it from other sections. We explore the possi-
bility of taan detection and labeling in Hindustani classical Khayal vocal concerts by proposing
features that distinguish taan from other sections. The main challenge is detection of taan sec-
tions at a time scale meaningful to musicians. There should be mechanism to ignore taan like
movements that may occur in other sections in Khayal performance. Taan section might occur
multiple times in the Khayal performance and we need to retrieve all of them without inducing
false positives. These are the few challenges of taan detection and labeling that we wish to
address through this work. A general system for detecting sections of the Khayal vocal concerts
is proposed. Since taan section can be distinguished using pitch alone, we need an automatic
vocal pitch extraction algorithm. Pitch extraction algorithms are not yet fully automatic with
around 80% accuracy possible. Since pitch extraction of the lead vocal artist is a crucial step for
coming up with features for detecting sections in Khayal, one of the goals will also be to look into
possibilities to improve the accuracy of the current state of the art pitch tracker [4] by improving
one aspect of it, which is the Singing Voice Detection (SVD). The study of detection of sections
Chapter 1. Introduction 2
and labeling system for the Khayal vocal concert with proposed features for the taan section
and evaluation of the system is the main focus. Such a system will be helpful in audio browser
like Dunya [5] which is dedicated for easy access and navigation of Hindustani classical concerts.
It will be useful in fast navigation through Khayal concerts and for music summarization.
The existing segmentation systems are discussed in Chapter 1. There is no audio segmen-
tation system that works on Khayal concerts, but systems that work using a different database
that might be applicable to our case are mentioned. We give an overview of the proposed sys-
tem in Chapter 3. We discuss about the database that we have used for our study and give the
specialities of sections in Khayal concerts in Chapter 4. The database was selected considering
that all types of taan are present in the database and there is representation of almost all school
of thoughts in Hindustani classical music. Pre-processing of the audios is required to obtain
reliable pitch track corresponding to the vocal melody which is discussed in Chapter 5 followed
by description of features for taan section of Khayal in Chapter 6. The experiments done at
intermediate and final stage of the system and discussion of the results obtained is presented in
Chapter 7 followed by conclusion in Chapter 8.
2
Chapter 2
Literature Survey
A broad overview of methods available for structural segmentation is summarized in [6]. The
primary aim of the methods described in the overview paper is to identify the various homo-
geneous segments within an audio and in case of some papers labeling the segments as well.
The papers reviewed in this article mainly have features that exploit one or group of aspects
related to melody, harmony, rhythm and timbre. Only a few papers have used all these aspects
of audio together for segmentation and the author states that this might be more effective way
of approaching segmentation problem. The three criteria to approach the problem of segmen-
tation across the papers are seen to be based on repetition, homogeneity, and novelty. Here the
homogeneity and novelty based approaches give similar information with homogeneity based ap-
proaches telling about the contents within a segment and novelty based approaches telling about
the boundary between contrasting sections. None of the approaches were seen to be particularly
superior to other and have been tried only on Western popular music where the structure is
relatively standard. The author stresses on investigating the use of musically motivated features
and different distance measures for frame level feature comparison.
The paper by Turnbull et. al. [7] is important in the view that it has used all the aspects
of the audio of timbre, melody, rhythm and harmony. They use a supervised approach where
the features along with their first and second order difference as also their smoothed versions
are passed to a boosted decision stump classifier. Here they train the classifier to give out-
put as frames belonging to boundary or non-boundary class which is an unconventional way
to approach the problem. Using approach of providing the information of change was seen to
better for a supervised framework. Also the final features after feature selection were seen to
be corresponding to all the aspects i.e. rhythm, timbre, melody and harmony thus stressing the
importance of using all the dimensions of music for effective segmentation.
The paper [8] deals with applying HMM based methods for the case of music segmentation
problem. The author mentions that the method is likely to succeed on recordings made with
modern production techniques, particularly where copy and paste has been used to produce mul-
tiple segments of the same type. However the noteworthy aspects which can be mentioned are:
the use of adaptive hop and window size (for feature calculation) depending on the beat interval
and use of histogram of states obtained from the Hidden Markov Model (HMM) to determine
the labels. The segmentation is performed on western music audios with sub-band energy in
3
Chapter 2. Literature Survey 4
logarithmically spaced bands as feature. A beat tracking algorithm is used to track the beat
location which are used as a hop size with the frame size also changing dynamically according to
the inter-beat interval. This can be useful in Khayal vocal concert segmentation as the tempo
keeps increasing gradually over the concert. Thus we might want to use similar functionality
for feature calculation. An HMM is trained with fairly large number of states on the features.
During training each state, the output is assumed to have a single Gaussian distribution. After
getting the state probabilities for each of the states, Viterbi algorithm is used to maximize the
probability of the observed data. The best path output from the Viterbi algorithm is used to
do segmentation of the sections. Thus now for every audio, the audio waveform is transformed
into a time series with a specific value for every beat position. The authors have clearly depicted
the need for the correct choice of the HMM states in bringing out specific patterns in the au-
dio. Every HMM state will not correspond to a particular segment in the audio structure but a
collection of states will contribute to it. Thus at each of the beat positions the histogram of the
HMM states is computed. Manual segmentation is used as a reference template for histogram
distribution and then for each of the beat locations the nearest reference template is assigned as
label for that beat.
Another paper that uses various dimensions of the audio such as timbre, rhythm, harmony,
energy via MFCC, rhythmogram, chroma, short time energy representations is [9]. This paper
emphasizes the performance improvement obtained in the segmentation if we have multi-track
audio data. The main idea is that each of the features may get affected in polyphonic recordings
which is generally unaffected in separate recordings. They also start with the beat tracking
algorithms and then assign, for each of the beat positions a feature vector. Multiple features
are extracted for each of the tracks like 13 MFCC, 12 chroma vector, 200 rhythmogram based
and 1 RMS energy based features. For each track we get 226 features and for each sub-category
of features, SDM is computed separately. Each of these SDMs is convolved with appropriate
kernel to obtain the novelty score. In order to get the best performance, each of the SDMs
computed are given different weights out of 1, 10, 100. A final SDM is obtained after addition
of individual SDMs and then its performance is computed for different combination of weights.
For a particular genre, the weights which give the best performance are chosen as the final model.
In case of Hindustani classical music, segmentation problem was tried using rhythm as a
feature by [2] for Khayal vocal concert as well as tabla concert. In Hindustani music the changes
in rhythm can be perceived when the tabla performer improvises from the basic rhythmic cycle
or when the rhythmic cycle itself changes. Tabla onsets are detected, by processing in the band
of prominence of tabla strokes using onset detection function, thus giving a representation of
rhythm. The visual representation of the feature using a self distance matrix reveals the rhyth-
mic structure of tabla solo concert, having tabla as the lead instrument, and gives boundaries of
the major segments of the Khayal vocal concert. Spectral features combined with the auditory
processing motivated bi-phasic function achieved good time localization of onsets for polyphonic
audio. The changes in self distance matrix are detected by correlating a checker board kernel
of width relevant for desired temporal resolution across its diagonal to get the novelty score
depicting the changes. Here, however, evaluation of the system was not presented and it was
mentioned in the paper that the subsections within the Khayal vocal concerts cannot be seg-
4
Chapter 2. Literature Survey 5
mented using rhythm alone.
Verma et. al.[1] have worked with Hindustani instrumental concerts and Dhrupad vocal
performances and have combined various features dealing with different musical aspects of the
audio like the chroma, tempo and energy based features. They have proposed conversion of
features into their posterior probabilities to make them robust to local variations. Limitation of
their work is that they use a dataset which has little to no variability in the number of sections
and skipping or repetition of sections in the concerts is not seen.
Ranjani et. al [3] carried out classification of Carnatic audios into their musical forms. They
do not perform segmentation but perform labeling of audio which represents a segment. Hierar-
chical classification was used to label the segment to which the given audio belongs using absence
or presence of rhythm as a major cue. This can be used in Khayal vocal concert segmentation
task after obtaining the boundaries to achieve final label names using the feature value distri-
butions.
Since our task involves the detection and segmentation of a specific named section of the
concert, we need to invoke both segmentation and supervised classification methods. Musically
motivated features and methods are our chosen approach given their potential for success with
limited training data [10]. The challenges to taan detection are the polyphonic setting where we
want to focus on the vocal signal, and designing distinctive features that are artist and concert
independent. Given that pitch modulations are the prime characteristic of taan, reliable pitch
detection with sufficient time resolution is necessary. Finally, we need to convert the low level
analyses to annotation that closely matches the musicians’ labeling of taan episodes from a per-
formers point of view. Towards these goals, we use a vocal source separation algorithm based
on predominant-F0 detection [4]. Features designed to capture the characteristic of rapid but
regular pitch and energy variations of the voice are presented. A frame level classification at 1sec
granularity is followed by a grouping stage with the goal of emulating the subjective labeling of
taan by musicians as extended regions that occur at salient positions in the concert.
5
Chapter 3
Segmentation System Overview
Figure 3.1: Simplified Block Diagram for detecting sections in Khayal vocal concert
We want to segment out different sections in a Khayal vocal concert using their acoustic
characteristics as discussed before. The final aim is to get correct section boundaries along
with their labels. For that we propose modules as seen in the block diagram of Figure 3.1
consisting of two main blocks, viz. feature extraction block and change detection and labeling
block. The feature extraction block deals with observation of acoustic characteristics and using
the existing features or coming up with new features which will bring out distinction between
various Khayal vocal concert segments. The pre-processing block will process the audio and give
it in simplified form to perform feature extraction. For example, if features are to be extracted
on vocal pitch alone, then pre-processing will perform vocal pitch extraction, if rhythm related
features are needed then tabla strokes enhanced in a sub-band will be provided. The change
detection and labeling block deals with means to come up with boundaries between sections
using the existing framework of self-distance matrix (SDM) and the subsequent modules which
seem to be suitable for the problem. The challenging aspect is to come up with section names
and detecting their recurrence. The detailed block diagram is given in Figure 3.2
6
Chapter 3. Segmentation System Overview 7
Figure 3.2: Block Diagram for the proposed system of detecting segments in Khayal vocalconcert
3.1 Feature Extraction Block
This block includes pre-processing step followed by the feature extraction step. Each of them is
explained in the following sections.
3.1.1 Pre-processing
Since vocal pitch is the primary cue for various sections, we need to extract it reliably to further
compute features on it. Also tabla onsets might be a cue for broad level sectioning of Alap,
Bada and Chota Khayal sections, so we need source and instrument sound isolation to extract
features separately on them. The pitch extraction of the dominant source is first carried out
using the technique of [4]. The assumption is that the vocal melody being in the lead will be
having the strongest contribution to the spectrum. Thus by extracting the predominant F0, it
is expected to get melody corresponding to the voice where both voice and Secondary Melodic
Instrument (SMI) are present. This is further used in dominant source spectrum isolation block
to isolate the dominant source spectrum by reliably extracting sinusoidal partials using main-
lobe matching technique. Line spectra are obtained at each analysis frames by searching in the
vicinity at multiples of location of the detected pitch.
Features are extracted on the source isolated spectral envelopes as they are less dependent
on the pitch of the source and represent the source timbre better. The feature extraction block
as seen uses static and dynamic timbral features as also dynamic F0-harmonic features as in
[11]. Using only selected features for the genre under consideration i.e. Hindustani classical,
and applying unsupervised classification algorithm like k-means clustering, labels are obtained
for the frames as belonging to either the voice or the instrumental regions. An unsupervised
7
Chapter 3. Segmentation System Overview 8
algorithm is used as the nature of the acoustic characteristics is clearly different for voice and
instrument regions within an audio. Further from here we can extract the pitch corresponding
to the vocal regions alone to give it to the feature extraction block.
The accuracy for the Singing Voice Detection (SVD) using the unsupervised classification
algorithm of k-means clustering and genre specific features was seen to be 83% in a previous
study. Further experimentation and evaluation on the database in consideration for this thesis
is explained in the section 4.2. The pitch extraction algorithm by [4] is superior to others for the
case of Indian classical music with minor occasional octave errors. Any other errors as maybe
observed during the course of the work also need to be taken care of, to improve the pitch
extraction so that reliable feature extraction is possible for detection of sections.
3.1.2 Feature Extraction
We can derive features for the purpose of detection of sections in Khayal vocal concert using the
extracted vocal pitch, or the source isolated spectrum belonging to the voice alone as per what is
suitable considering the acoustic characteristics of those Khayal sections. Other features based
on the timbre or the rhythmic onsets of the tabla can be derived directly from the audio using
pre-processing as per required. Acoustic characteristics and possible features for each section
have been discussed in section 4.2. Here since we have chosen the features such that they are
representative of a musical section, we expect them to be homogeneous within sections while
being in contrast across sections. This contrast among feature values across sections can be
captured by giving the features to the change detection and labeling block.
3.2 Change Detection and Labeling Block
This block is generic and can be applied to any set of features irrespective of the style of music,
as the motive of the block is to detect change between contrasting sections. But for the purpose
of labeling, some changes might be required. The first component in the change detection and
labeling block is classification using features. The posterior probability values obtained from
the classification can be used for the computation of Self-Distance Matrix (SDM) and novelty
score. The SDM can be computed on the features directly or by transforming them to their
posterior probability values. This transformation was seen to be effective in case of segmentation
problem of Dhrupad vocal and instrumental alap audios [1] as seen in Figure 3.3 and hence can
be investigated for use in Khayal vocal concert as well. In case of [1] the sections to be identi-
fied are fixed in number as also they occur sequentially, hence the labeling of the sections was
not a major problem. They have used unsupervised GMM classification to convert the features
into posterior probability values, with the number of classes in the classifier to be equal to the
number of sections. In case of detection of sections in Khayal vocal concert, though the number
of sections are fixed, they do not occur sequentially and generally repeat over the concert in a
way that can not be generalized. Also we need to finally label the sections which will be not
straight-forward if we use unsupervised classification methods. Hence, the preferable way in this
case to convert features into posteriors using a supervised classification algorithm and use the
labels obtained for further post-processing. An SDM is computed on the features / posteriors
8
Chapter 3. Segmentation System Overview 9
using D(i, j) = d(xi, xj) for i, j�{1, 2, .., N} where distance function d specifies the distancebetween two frames xi and xj . Typically used distance measures are Euclidean distance or the
element wise dot product [12]. SDM provides with visualization of how distinct the features are
in different sections and which sections repeat. As seen in Figure 3.3 b clearly 3 blocks are seen
corresponding to alap-jod -jhala in the SDM.
Points of high contrast in the SDM can be detected by convolution along the diagonal with a
checker-board kernel [12]. The dimensions of the kernel have to be decided depending upon the
time scale of the section to be detected. A one-dimensional plot results from this convolution
and is called a novelty score whose peaks indicate the contrasting points which might be at the
locations of section boundaries due to suitable feature selection. The peaks in the novelty score
might be closely spaced and spurious. The detection of peaks is thus done in the peak detection
block using ‘local peak local neighbourhood’ search as proposed in [7]. Even after this process-
ing, the novelty score can still have multiple peaks thus resulting in over-segmentation. At this
point, we know the labels and hence we can name the sections between two peaks. Multiple
sections can be further merged in the post-processing block using rules derived from musicians’
annotations. For example, through observation of maximum duration between two same sections
ignored by the musicians to merge the sections.
Figure 3.3: SDM (top) and novelty functions (bottom) for a sitar concert computed by(a) acoustic feature vectors (b) posterior feature vectors as depicted by [1]
The final peaks thus obtained will represent the boundaries of the sections and the region
between them will have labels in case of supervised classification. In case of the unsupervised
classification as proposed in [1], on the other hand, we would still have to determine the labels
for the sections as also detect repetition of the sections. Since the features are musically derived,
we can make use of the feature values along with the boundary points to come up with some
rule based decision or hierarchical classification as done by [3]. Another approach could be using
9
Chapter 3. Segmentation System Overview 10
Hidden Markov Model (HMM) to detect repetition but at the level of section rather than at
frame level [13]. While image processing based methods to detect repetitions have been reported
in [6], the method in [3] seems to be attractive due to its simplicity.
10
Chapter 4
Database Description
We have 102 concert audios of 24 artists from commercially available CDs which span number of
gharanas of vocal music. The audios were stored at 16kHz sampling rate with 16 bit resolution
and in single channel format. The total duration of the audios is 51 hours with 66 ragas covered
in the recordings. The concert audio with highest duration is of 1hr and 3mins while smallest
audio is of 15mins while the average audio duration is of 30mins. The duration of the concert
sections in the audio will thus vary to a lot of extent. The number of recordings of artists and
their gharana can be seen in Table 4.1. As can be seen there are a lot of artists belonging
to more than one gharana. The database was chosen such that almost all the gharanas were
covered in it. It was observed that all the artists in the database render akar taan . To keep
the overall percentage of bol taan and sargam taan at par with the akar taan, the number of
concerts of artist Jasraj, Ajoy Chakraborty and Kishori Amonkar were kept more as they were
seen to take sargam and bol taan. Inspite of this, the percentage of sargam taan considering
all the concerts is just 1.6% while the percentage of akar taan is 13.3% and that of bol taan is 3%.
The audios were selected considering that they had all the major sections i.e. the unmetered
alap, Bada Khayal, Chota Khayal and their improvisation sections. These divisions within the
Khayal performance are explained in detail in section 4.2. As discussed earlier, the artists may
skip sections or choose the tempo of the performance depending on the time constraints. Gen-
erally, the taan section marks the end of a Bada Khayal and a Chota Khayal concert in any
performance but an artist might just skip the taan section after the Bada Khayal and might
render it directly at the end of the Chota Khayal. This was observed particularly in case of
shorter duration audios of 15-20mins duration.
4.1 Database Subsets for Evaluation
Detailed experimentation has been performed using different specialized subsets within the
database. The subsets created and their purpose is described as ahead with their summary
in table 4.2.
11
Chapter 4. Database Description 12
Table 4.1: Distribution of clips per artist and the gharana of the artists
Artist No. of Clips Gharana
Aarti Ankalikar 2 Agra, Gwalior, AtrauliAshwini Bhide 1 Jaipur-AtrauliAjoy Chakraborti 12 PatialaAslam Khan 4 AgraDattatreya Velankar 2 Gwalior, KiranaGirija Devi 4 BanarasGauri Pathare 2 Jaipur, Gwalior, KiranaHirabai Barodekar 4 KiranaJasraj 24 MewatiJitendra Abhisheki 1 Agra, JaipurJayateerth Mevundi 1 KiranaJagdish Prasad 4 PatialaKishori Amonkar 9 Jaipur, Bhendi BazarKaushiki Chakraborti 4 PatialaKaivalyakumar Gurav 2 KiranaKumar Mardur 2 KiranaManik Bhide 2 Jaipur-AtrauliMani Prasid 2 KiranaMalini Rajurkar 3 GwaliorPrabha Atre 10 KiranaPrabhakar Karekar 2 Agra, GwaliorRaghunandan Panshikar 1 JaipurUlhas Kashalkar 2 Gwalior, Jaipur, AgraVeena Sahastrabuddhe 2 Gwalior, Jaipur, Kirana
12
Chapter 4. Database Description 13
4.1.1 Database Subset of 32 Concerts
This contains an almost equal mix of male and female artists, and different ragas from the 102
concert database. The total duration of the 32 concerts is of 17 hrs. This database is needed
for evaluation of 3 tasks that need to be performed before proceeding to detection of taan:
i) Singing Voice Detection (SVD):Taan features are derived from the melody alone and
rely on the accuracy of the vocal melody extraction. To extract melody, we need to first identify
the regions where vocal melody is present. This can be done using the SVD features as pro-
posed by [11] which is considered to be state-of-the-art for Hindustani classical music. We need
to evaluate the performance of the SVD algorithm and look into possible improvements that
can be done. This subset data has been hence annotated for Vocal and Instrumental regions for
SVD evaluation. The total duration of the annotated data is 17hrs out of which the duration of
vocal regions is 12hrs. The output of the SVD algorithm can be compared with this annotated
ground truth to obtain the accuracy.
ii) Obtaining finer ground truth within taan section (SAD evaluation): If we ob-
serve a taan episode, there are some non-taan movements also occurring between the taan move-
ments as seen from Figure 5.3. For initial evaluation, we do not want the classifier accuracy to
get affected by these non-taan regions coming between the taan section. In-order to facilitate
this, the finer frame-level markings within the taan section were obtained automatically using
the Speech Activity Detector (SAD)[14]. The evaluation of SAD was done on this subset data.
iii) Inspection of tempo invariance of features: As described in the previous chapter,
over the concert duration, gradually, the tempo keeps increasing. We want to inspect if the
features proposed are invariant to the tempo variation. Generally an abrupt change in tempo is
seen at the start of Chota Khayal section. The timing of this starting point of Chota Khayal can
be used for this purpose. This has been addressed in detail in section 6.3
4.1.2 Database Subset of 96 Concerts
Among the 102 concerts, 32 concerts were evaluated for SVD accuracy. Nonetheless, taan sec-
tions in all the concerts were checked to see if pitch is not missing for major parts in taan sections.
It was observed that absence of pitch contour though vocal melody was present in the audio was
occurring for 6 concerts among the 102 database. Detailed analysis of this is presented in the
section 5.1.
4.1.3 Database Subset of 22 Truncated Jasraj Concerts
In the 96 concert database, 22 concerts belonging to artist Jasraj were used initially to observe
the performance of the supervised classification algorithm for detection of taan section. The
algorithm was subsequently tested on larger database of the 96 concerts. This subset was
created to ease debugging of the various steps involved in detection of sections and to obtain
optimal operating point to be used further on the larger database.
13
Chapter 4. Database Description 14
4.1.4 Musicians’ Annotation: 24 Concert Subset
The end goal of this work is to mark the taan episodes which are meaningful to the musicians.
It is important to obtain the taan section markings from different musicians and compare the
differences to come up with the logic behind their annotation. The differences were mainly
in the taan episode marking in terms of allowable vocal non-taan regions between consecu-
tive taan sections or the instrumental improvisation between the taan sections. This can be
quantified and used to obtain a simple heuristic, to achieve the highest level of grouping of
taan sections, by examining the region of audio separating every two detected taan segments in
the musicians’ annotation. For this purpose a concert of each artist was chosen to put together
this subset dataset of 24 concerts. Taan boundary markings were obtained from 3 artists and
observations were carried out on them. In general, among all the markings it was observed that
the mukhada occurring at the end of a rhythmic cycle was combined in the taan episode. It
was seen that at the most 10sec of vocal duration which corresponds to non-taan was combined
in the taan episode. Also instrumental improvisation occurring between rhythmic cycles was
considered as a part of taan if its duration was less than 50secs. These insights were used to
combine consecutive taan segments.
Table 4.2: Summary of database subsets
DatabaseSubset
Contents Purpose
Subset A 32 concerts
SVD and SAD evaluation,proving tempo invariance of features,comparison of taan detection performanceusing different pitch detection algorithms
Subset B22 Jasraj concerts (testing)+35 other maleartists’ concerts (training)
truncated concerts used forfinding the f-measure,initial testing of the system
Subset C24 concerts(1 concert per artist)
Musicians’ annotation forobtaining grouping heuristics
Subset D 96 concertsEvaluation over the entire dataset,comparison of the proposed methodwith [1]
4.2 Khayal Concert Details
Khayal concert is most popular form in Hindustani classical vocal music and is based on the
theme of a raga. A raga is not just a scale of allowed notes but also has the motifs associated
with it. The vocal artist uses the raga as the theme of the performance and is accompanied by
tabla for rhythm, harmonium or sarangi for melodic accompaniment and tanpura for keeping
a reference to the tonic. The role of the melodic accompaniment is to follow the main melody
while the role of tabla is timekeeping. The vocal artist is the main performer of the concert and
he has the liberty to give a few minutes in the concert to the tabla and harmonium artist for
showing their skills at improvisation.
14
Chapter 4. Database Description 15
4.2.1 Khayal Concert Structure
The aim of Khayal is to elaborate on the idea of the raga via motifs and note by note elaboration
as perceived by the performing artist. In a typical Khayal vocal concert the artist chooses a
raga and starts introducing it through motifs in the alap . This introductory alap is not accom-
panied by tabla i.e. the percussive accompaniment of the concert. The alap atleast lasts for half
a minute and may also be rendered for more than 10mins depending on the gharana i.e. the
school of the artist. This is followed by Bada Khayal composition and its improvisation section
where the tabla sets in with a particular tala in a slow (vilambit laya) or medium (Madhya laya)
tempo as per what is suitable for the bandish i.e. is the composition selected by the artist. The
bandish generally comprises of four lines of poetry. At the start of the Bada Khayal the artist
renders the first 2 lines, called the Sthayi, of the composition once or twice. The Sthayi is limited
to the middle and lower register [15].
After this the artist starts with the improvisation of the Bada Khayal by taking first the
alap which can be rendered using the lyrics of the composition or vowel /a/. The next two lines
of the composition, called the Antara, have melody in the second part of the middle octave and
higher [15]. Hence depending on the raga elaboration followed by the artist, it can be taken
when the artist is near those notes. After the alap the artist plays around with the rhythm and
melody in the the baat section which can again be rendered using lyrics of the composition (bol),
note names (sargam) or vowel /a/ (akar). This section may be present or absent depending on
the artist. After baat , follows the taan section which can be rendered using again the lyrics, note
names or vowel /a/. Rendering of the taan section marks the end of the Bada Khayal improvi-
sation. An artist may sometimes prefer to not take taan section in Bada Khayal improvisation
but that is rare. Then the concert takes on a faster tempo relative to Bada Khayal by moving on
to Chota Khayal composition and its improvisation sections which are same as Bada Khayal i.e.
the alap , baat and taan. Many times since the rhythm has entered fast tempo, the artists prefer
to skip the alap and baat section after the rendering Chota Khayal composition and take the
taan directly. Many a times the artists prefer to take a tarana instead of Chota Khayal but the
improvisation sections remain the same as that of Chota Khayal. The tarana has no apparently
meaningful lyrics and has words like ‘dirdir’, ‘tanana’, ‘dim’, ‘tom’, etc. as well as the bols of
tabla sound. Individual acoustic and musical characteristics of various Khayal vocal concert
sections is mentioned in Table 4.3 and their hierarchy is depicted in Figure 4.3. The sequence
of different sections in the entire concert of 27mins can be seen in Figure 4.1
The Figure 4.2 shows a rhythmogram which is a two dimensional time-pulse representation
with lag-time on y- axis, time position on the x-axis and the auto-correlation values of onsets vi-
sualized as intensity. The auto-correlation peaks give us an idea about the tempo in the concert.
As can be seen in the figure, as the auto-correlation peaks get closer, it can be interpreted that
the tempo is increasing. Throughout the concert, the tempo is seen to increase gradually within
the Bada Khayal with an abrupt increase in tempo (maybe also change in tala and raga ) for
Chota Khayal as seen in Figure 4.2. The tempo is not fixed for the Bada Khayal across concerts
to a particular value. Many artists tend to take tempo as slow as just 10bpm to as fast as 40bpm
which might seem like Madhya laya [16], while the drut laya ranges from 160-320bpm. The
improvisation sections may not follow any particular order but the above mentioned sequence
15
Chapter 4. Database Description 16
Figure 4.1: Approximate sequence of different sections in Khayal vocal performance ofraga Shree by artist Kumar Mardur
Figure 4.2: Rhythmogram of tabla onsets in raga Deshkar Khayal vocal performance byartist Kishori Amonkar as depicted in [2].
is a general trend. The melody improvisation is gradual and note by note with the ‘mukhada’
marking the end of a rhythmic cycle. The artist generally starts with improvising in the lower
octave, then the middle octave and then advancing to the upper octave. Melodic movements do
not span multiple octaves within a rhythmic cycle with the exception being the taan section,
where the artist tries to show off his mastery over the voice. According to [17] the various sec-
tions do not fall into rigid divisions but there might be occasional overlap between the sections.
Since the sections according to musical context defer in proportion, placement and quality of
impact, one section cannot be mistaken for other.
4.2.2 Taan Section
In this work, our focus is on segmenting taan sections that are melodically salient i.e. the
sequence of melodic phrases or notes is rendered in a characteristic melodic style. The notes
may be articulated in various ways including solfege (sargam taan) and the syllables of the lyrics
(bol taan). Most common however is the akar taan , rendered using only the vowel /a/ (i.e.,
16
Chapter 4. Database Description 17
Figure 4.3: Various Sections in Khayal concert are depicted with the sequence being thealap without percussion followed by Bada Khayal and its improvisation component
as melisma). The sequence of notes is relatively fast-paced and regular, produced as skillfully
controlled pitch and energy modulations of the singers voice similar to vibrato. But unlike the
use of vibrato which ornaments a single pitch position in Western music, the cascading notes of
the taan sketch elaborate melodic contours like ascents and descents over several semitones. The
melodic structure is strictly within the raga grammar while the step-like regularity in timing
brings in a rhythmic element to the improvisation in contrast to the (also improvised) alap
sections. Apart from showcasing the singer’s musical skills, one or more taan sections typically
contribute to the climax of a raga performance and therefore serve as prominent musicological
markers. These unique characteristics of the taan section motivate us to investigate it further
for detection of this section. Table 4.3 provides with the musical and acoustic characteristics of
the taan section. All the concerts may not have all the taan types. It was seen that the akar
taan is rendered by artists over all the schools of music while the least rendered type was sargam
taan . The minimum percentage of taan section in the database was found to be 3.9% while
38% was the maximum taan percentage within a concert considering the total concert duration.
On an average 18% of the concert is taan section. The minimum duration of the taan section
is 5.6sec across all the concerts. Database for the task of detection of taan and pre-processing
step evaluation is described in next section.
4.3 Annotation Methodology
Keeping in mind the musical characteristics mentioned in table 4.3 as also the acoustic charac-
teristics we can mark the section in Praat [18] giving us the time instances of the start and end
time of the section. Before starting with the annotation, the musicians were explained the need
for annotation of taan section i.e. for easy navigation in a concert to go to a particular section.
The musicians were asked to annotate the taans in similar way as though they were pointing
out to their students the locations of taan episodes for their study of taan . Taan like movement
might occur in other sections for short intervals as well, but does not form the taan section.
Also, the taan section can occur multiple times and the artist has the liberty to render it any-
17
Chapter 4. Database Description 18
where in the concert. Multiple passes near the boundary were thus allowed to finalize it as a
valid boundary. The rhythmic cycle is important here as the tempo is slow in the initial part
of the concert and fast in the later part. If majority of a rhythmic cycle contains taan like
melodic movements which are followed by more such movements over subsequent cycles then it
was marked as a taan section. A taan section thus comprises of cluster of taan like movements
occurring over multiple cycles. The section end is marked if a different homogeneous section
other than taan episode starts.
18
Chapter 4. Database Description 19
Table 4.3: Khayal concert sections with musical characteristics and acoustic correlates
Sections Musical Characteris-tics
Acoustic Charac-teristics
Features
UnmeteredAlap
Tabla : (a)Absent,Unmetered, non-pulsated section VocalMelody: (b) Introduc-tion of raga which is aslow rendition of raga-motifs with more steadynotes. (c) Dependingon raga nature, artistgenerally will start frommiddle octave Sa, thento lower octave, movingback to middle octavethen to higher octaveand again ending inmiddle octave Sa. (d)Rendered only withvowel a.
(a) Wideband event ofhitting of tabla ab-sent as no percussionis present. Presenceof only voice of leadartist and accompani-ment. (b) Long steadypitch at frequenciescorresponding to notelocations of raga . (c)Evolution of pitch val-ues in alap can be ob-served (d) Harmonicsat formant locationscorresponding to vowel/a/ appear dark.
(a) Absence of tempoin rhythmogram oftabla onsets. (b)Steady Note measureon pitch values ifrhythmogram is notsufficient. (c) Shorttime histogram can beused to see the notebeing emphasized ateach time interval andtrend over time. (d)Spectral centroid
BadaKhayal
Tabla : (a) Percus-sion sets in. VocalMelody: (b) Dis-tinctive features ofa raga are displayedthrough compositionhaving two parts, viz.sthayi and antara. (c)Usually the first line ofsthayi called ‘mukhada’serves as a recurringtheme in a performanceand gives a cue for the1st beat in a rhythmcycle. (d) Sthayi :Melody is in 1st partof middle octave andpart of lower octave.Antara: Melody is in2nd part of the middleoctave to upper octaveand beyond. Tempo:slow or medium
(a) Wideband event ofhitting of tabla at reg-ular intervals seen inspectrogram. (b) Gen-erally change of pitchat the beats. Presenceof long held pitchdue with slow tempo.Lyrics comprising ofvowels and consonantsbring a break in anote being renderedcontinuously. Changein formant locationwith change in vowels.(c) ‘mukhada’ melodicphrase is generallyless variable in termsof pitches used andcan be identified toget the start of cycle.(d) Evolution of pitchvalues in alap section.
(a) Total section ofBada Khayal andits improvisationcan be separated us-ing rhythmogram oftabla onsets. (b) Justthe bandish section canbe separated using cueof lyrics through spec-tral centroid on sourceisolated spectrumas it comes immedi-ately before alap withtabla and immediatelyafter alap withouttabla . (c) ‘mukhada’identification [19]can be used (d) Shorttime histogram can beused to see the notebeing emphasized ateach time interval andtrend over time.
19
Chapter 4. Database Description 20
Sections Musical Characteris-tics
Acoustic Charac-teristics
Features
ChotaKhayal
Tabla :(a) Tempo is fast(in comparison withBada Khayal ) VocalMelody: (b) Similarto Bada Khayal it is acomposition with sthayiand antara. Tarana canalso be taken which usessyllables like ta, na, de,re, dim instead of lyricsor even tabla bols.
(a) Interval be-tween tabla strokesis less than that inBada Khayal. (b)Pitch not held forlong duration (relativeto Bada Khayal).Increase in pitchornamentation as com-pared to Bada Khayal
(a) Rhythmogramof tabla onsets canseparate Chota fromBada Khayal. (b)Spectral Centroidand its delta as con-sonants occur inquicker succession ascompared to that inBada Khayal wherevowels stretched onlong notes.
Alap /Vistar(Akar /Bol)
Tabla : (a) Slow tempo.Percussive fillers more.Vocal Melody: (b)Slow elaboration of theraga through sequenceof melodic phrases withemphasis on restingnotes. (c) Melodymainly remains in 1stpart of middle octave.(d) Elaboration doneusing either vowel/a/(akar ) or the lyrics ofthe composition (bol ).Emphasis on melodicpatterns, variation ofmotifs.
(a) Wideband eventof hit on percussiveinstrument visibleat large separated in-stances as compared tothat in Chota Khayal.Fillers present. (b)Constant pitch heldfor a long time. (c)Pitch remains inmiddle octave. (d)Formant changes aslyrics are uttered.
(a) Rhythmogram ontabla onsets will not beable to distinguish thissection from others inBada Khayal impro-visation section. (b)Steady note measureto distinguish frombaat and taan sec-tion. (c) Short timehistogram can be usedto see the note beingemphasized at eachtime interval and trendover time. (d)Spectralcentroid to distinguishbol from akar .
20
Chapter 4. Database Description 21
Sections Musical Characteris-tics
Acoustic Charac-teristics
Features
Baat /Layakari(Sargam/ Bol)
Tabla : (a) May mimicthe patterns of singer orplay normally depend-ing on context. VocalMelody: (b) Rhyth-mic improvisations us-ing names of the notes(sargam ), or lyricsof composition (bol ).Stress given at the beatsby doing note changesor lyric changes there.Speed of bol/sargamcan be same, or twiceor any multiples of thetempo but not as fast asin taan . Playing withnotes and lyrics withemphasis on rhythm.(c) No long held notes.
(a) Regular intervalconsonant breaks inthe spectrogram. Notetransition at percus-sive hit. (b) Pitchmodulation, if any,moderate as comparedto taan . No long heldpitch.
(a) Rhythmogram onvocal onsets will beuseful after source iso-lation. Spectral cen-troid after source iso-lation (b)Pitch modu-lation captured via en-ergy ratio with differ-ent range of frequen-cies corresponding tooscillations here
Taan(Akar /Bol /Sargam)
Tabla : (a) Tempomight get faster withbasic tala being played,without fillers. VocalMelody: (b) Rapidgamak like movementstaken using vowel /a/(akar ), lyrics of thecomposition (bol )or name of the notes(sargam ). (c) Thepatterns range overall the 3 octaves andthe section serves asclimax to the raga pre-sentation. Emphasison vocal skills. This ismost distinctive sectionof the Khayal rendition.
(a) Wideband eventof hit of percussivebeats at regular inter-vals (with no fillers)but at less time inter-val. (b) Rapid pitchoscillations. Rapid en-ergy fluctuations. Forakar , formants cor-responding to vowel/a/ look dark. Incase of bol the for-mants change gradu-ally, while in case ofsargam they changerapidly.
(a) Rhythmogramcannot distinguish thissection from others.(b) Frequency of os-cillatory pitch, rate ofzero crossings of meansubtracted energycontour. Spectralcentroid for formantchange detection inbol and sargam .
21
Chapter 5
Pre-processing
Before extracting the features, we need to extract the pitch corresponding to the vocal melody.
This we approach via slight modifications to the state-of-the-art singing voice detection (SVD)
algorithm [11]. Also another step required is to obtain finer ground truth for calculating the
frame-wise accuracy of taan detection as described in section 4.1.1. Each of the tasks is described
in detail as below.
5.1 Singing Voice Detection (SVD)
Taan features are derived from the melody alone and rely on the accuracy of the vocal melody
extraction. An important step for extracting melody is the detection of regions where the singing
voice is present so that melody can be extracted only in those regions as explained in section
3.1.1. This data Subset A has been annotated for Vocal and Instrumental regions for SVD
evaluation. The total duration of the annotated data is 17hrs out of which the duration of
vocal regions is 12hrs. If the artist takes any pause for taking breath then that is ignored and
included in the vocal region. Breath pauses are ones which have duration perceptually insignif-
icant (roughly less than 80ms) as referred to by [23]. The output of the SVD algorithm can be
compared with this annotated ground truth to obtain the accuracy.
The pitch extraction of the dominant source is first carried out using the technique of [4].
The assumption is that the vocal melody being in the lead will be having the strongest contri-
bution to the spectrum. Thus by extracting the predominant F0, it is expected to get melody
corresponding to the voice where both voice and accompaniment are present. The F0 is further
used to isolate the dominant source spectrum by reliably extracting sinusoidal partials using
main-lobe matching technique. Line-spectra are obtained by searching in the vicinity at multi-
ples of location of the detected pitch.
Features are extracted on the source isolated spectral envelopes as they are less dependent
on the pitch of the source and represent the source timbre better. The static timbral, dynamic
timbral and dynamic F0- harmonic feature categories were proposed in [11]. Here they have ex-
perimented with just 13mins of Hindustani classical audios and used supervised GMM in leave
one song out validation with four mixtures per class giving an accuracy of 84%. In this work,
22
Chapter 5. Pre-processing 23
we combine feature categories and feature selection is applied using WEKA toolbox [20]. In
[11] feature selection was applied within each feature category and supervised classification was
carried out. We get 11 selected features from among all the categories (total 85 features) by
using Best First search method in WEKA toolbox [20]. We explore the possibility of using unsu-
pervised classification algorithm as the nature of the acoustic characteristics is clearly different
for voice and instrument regions within an audio. A particular audio will always contain tabla
for rhythmic accompaniment and tanpura for giving tonic as reference, but the melodic accom-
paniment might be sarangi or harmonium. Also an artist might choose to use a swaramandal
for different reference notes to be played. In rare cases for the Chota Khayal section, the artist
might choose to use both tabla and mridangam for rhythmic accompaniment. Due to this, it
would probably be better to make a within clip decision for Vocal and Instrumental marking.
Unsupervised classification algorithm, k-means clustering, is applied and labels are obtained
for the frames as belonging to either the voice or the instrumental regions by using one of the
dominant feature to make the decision. Further from here we can extract the pitch correspond-
ing to the vocal regions alone to give it to the feature extraction block. Singing Voice Detection
(SVD) using k-means clustering and genre specific features was seen to give 88.30% accuracy
when evaluated on the data Subset A. Further, the SVD accuracy within the taan regions was
also evaluated so that there is certainty that there are no SVD related errors for the taan episode
detection. The frame-wise accuracy within the taan regions came out to be 89.20% for the data
Subset A. The labeling of the clusters obtained after k-means is of major concern. The labeling
is done using one of the features with highest individual classification accuracy namely, the nor-
malized harmonic energy.
Figure 5.1: Algorithm marked Vocal (V) and Instrumental (I) boundaries for the audioconcert with least accuracy. The highlighted “I” marking should have been “V”
The accuracy was obtained by comparing the framewise labels and it came out to be 88.30%
which makes our SVD algorithm reliable for using it further for extraction of pitch in vocal
regions alone. The analysis of two clips where the accuracy was below 70% shows that the spec-
trograms were washed out i.e. the vocal harmonics were not clear in the spectrograms and in few
instances the background accompaniment was loud. As can be seen in Figure 5.1 the highlighted
region does not look clear though the voice of lead vocal artist is present / audible in that region
of the audio. Tanpura suppression might help in some cases where loud tanpura is affecting the
decision of SVD since long steady notes are getting marked as belonging to Instrumental category.
23
Chapter 5. Pre-processing 24
Though the accuracy of SVD was seen to be high, we inspected all the taan regions in the
total database of 102 concerts. It was seen that there are particularly 6 clips which had problems
in SVD decision in taan region apart from the 2 error clips analyzed in the data Subset A. In
the 6 concerts that were eliminated from the 102 concerts for taan detection to create the data
subset D, there was presence of background vocals. Also there was presence of accompanying
instrument sarangi which has frequency range similar to human voice and is played in continuous
fashion unlike harmonium. There is hence confusion arising in Vocal and Instrumental decisions
made by SVD algorithm. Figure 5.2 illustrates one such case. Though the pitch of the sarangi
is played higher than the voice, the SVD is not able to give more weight to pitch feature as all
the features have equal weights.
Figure 5.2: Spectrogram where along with the lead artist there is presence an instrumentsarangi for accompaniment which has frequency range similar to human voice
The frame level accuracy of SVD was also calculated using supervised GMM classification
in leave one song out validation as was proposed by [11]. Four Gaussians were used to model
each class (Vocal and Instrumental) and the accuracy was found out to be 80.10% which is less
than that using k-means classification (88.30%).
5.2 Obtaining Finer Ground Truth within Taan Sec-
tion (SAD Method)
We observed that the musicians labeled taan based on the perceived intent of the performer i.e.
relatively short duration of instrumentals and other vocal styles that occur between taan episodes
were subsumed by the taan label (as in Figure 5.3). For the real-world use case, we would like
our automatic system to match the musicians labeling of the taan sections in the concert. At
the same time, for initial evaluation, we do not want the classifier accuracy to get affected by the
non-taan regions coming between the taan section as seen in the Figure 5.3. In-order to facilitate
this, the finer frame-level markings within the taan section were obtained automatically using
the Speech Activity Detector (SAD). The SAD method is completely unsupervised and is based
on the speaker diarization system [14].
The SAD was provided with our carefully derived pitch and energy based features. It is an
iterative process of classification, where separate Gaussian Mixtures Models (GMM) are fitted
to the frames classified as speech and non-speech (in our case taan and non-taan). Classification
24
Chapter 5. Pre-processing 25
is performed using these models on all the frames again. The process repeats and it stops when
there is convergence or till
Chapter 6
Taan Section Detection
Taan section is important to be identified as it gives a hint of the end of a Bada Khayal or
Chota Khayal improvisation part. Taan section’s typical characteristic is rapid oscillatory pitch
fluctuations as seen in Figure 6.1. Further information about the type of taans and the need
to use pitch based features for its detection is given in table 4.3. It is similar to vibrato in
Western music but with the rate of frequency modulation being slightly different and the fact
that the oscillations are not limited about just a single note. These oscillations are irrespective
of whether it is being rendered in Bada Khayal improvisation section or in Chota Khayal. While
there might be some changes in the nature of acoustic characteristics in other sections due to
gradual increase in tempo over the entire duration of the concert, the taan rate typically lies in
the range of 5-10Hz across the artists irrespective of the underlying tempo changes occurring.
A spectrogram of a typical akar taan can be seen in Figure 6.1 where the x axis is time
and y axis is frequency in Hz. Observing across various concerts, the pitch oscillation frequency
was observed to be between 5-10Hz. As per the conclusions of studies conducted in [21], in a
pedagogical scenario the rate of taan oscillations range 1.65 to 3.14Hz, but we are interested in
actual performance scenario where this rate is quite high. The vocal melody extracted from a
concert shows the rapid pitch oscillations in the taan section which starts after 1196sec in Figure
6.2 (a) as compared to steady notes and slow ornamentation before 1196sec. As can be seen in
Figure 6.2 (a) there are around 6 oscillations occurring in one sec interval of 1198sec to 1199sec.
Figure 6.1: Spectrogram of a part (7sec) of a longer akar taan section in raga Madhukaunsperformance by artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melodyare seen with the darker harmonics corresponding to vowel /a/. The beat tier indicatesthe tabla hits which are visible as vertical lines in the spectrogram.
26
Chapter 6. Taan Section Detection 27
Figure 6.2: Performance by Kumar Mardur of raga Shree where taan section begins at1200sec with (a) pitch variations and (b) energy variations depicted corresponding to onlythe voice.
Figure 6.3: Spectrogram of a part of sargam taan section in raga Madhukauns perfor-mance by artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melody areseen with the darker harmonics corresponding to phones in note names like Pa, Ni, Sa,Ga, etc. The beat tier indicates the tabla hits which are visible as vertical lines in thespectrogram.
Another characteristic is that the energy corresponding to the taan section fluctuates rapidly as
compared to that in other sections as seen in Figure 6.2 (b).
While rendering an akar taan, the artist generally sticks to the vowel /a/ and the formant
locations corresponding to it can be seen as dark lines in the Figure 6.1. When we observe
the spectrogram of sargam taan in Figure 6.3, a lot of movement of formants can be seen as
the singer utters the swaras or solfege while singing the oscillatory melodic movements. Thus
there can be seen a lot of ‘breaks’ in the melody line which correspond to the consonants being
uttered. The bol taan section in Figure 6.4 also shows movement of formants, but not as rapid
as in case of sargam taan. This is because the artist utters the consonants after larger time inter-
vals as compared to that in sargam section and uses the vowels in the lyrics for a longer duration.
27
Chapter 6. Taan Section Detection 28
Figure 6.4: Spectrogram of a part of Bol taan section in raga Madhukauns performanceby artist Jagdish Prasad. The oscillatory pitch harmonics in the vocal melody are seenwith the darker harmonics corresponding to phones of lyrics ‘Tuma bina kaun’. The beattier indicates the tabla hits which are visible as vertical lines in the spectrogram.
6.1 Pre-processing / Melody Extraction
Vocal pitch is the only cue for distinguishing the taan section, therefore, we need to extract it
reliably to compute features on it. The SVD algorithm accuracy is 88.30% as seen in section
4.1.1(i) which makes the decisions of Vocal and Instrumental for the case of Hindustani music
very reliable. The pitch extraction of the Vocal regions marked by SVD algorithm is carried out
using the technique of [4] henceforth referred to as PolyPDA in this thesis.
6.2 Feature Extraction
We try to capture the acoustic characteristics of the taan section after observation of spectro-
grams in a number of clips. Both the pitch oscillations and the variations in energy contour can
be captured to detect the taan sections out of all the sections as per our observations pointed
in Figure 6.2. Regular pitch oscillations can be seen in the taan section as opposed to the
non-taan section. Additionally, the fluctuations in the energy contour are seen to be high in
case of the taan section. The pitch values are calculated at every 10ms interval after pitch ex-
traction from the polyphonic audio corresponding to only voice. Features are first calculated at
smaller analysis frames and then averaged over larger texture windows. This helps to eliminate
taan like movements that might occur in other sections spuriously and obtain an average overall
behaviour. For our study we have considered texture window of 5sec which is slightly less than
the minimum duration of taan section. The texture frame hop was kept to be 1sec. The anal-
ysis frames are of 1sec with 0.5sec hop as also seen to be effective in [22]. As can be seen from
the Figure 6.5, averaging over the texture windows is essential to avoid spurious feature values
when a taan like movement occurs in non-taan section. The pitch contour is first interpolated
if silences less than 80ms are present to avoid breaks in the pitch in presence of consonants and
breath pauses [23]. The feature values are normalized to zero-mean and unit variance across
the concert. From Figure 6.5 it can be seen from the unnormalized feature values that the
distinction between taan and nontaan regions is quite clear.
28
Chapter 6. Taan Section Detection 29
Figure 6.5: The taan feature values for Energy Fluctuation Rate feature for entire concertof raga Shree performance by artist Kumar Mardur with the ground truth of taan sectionas red boxes. Features are plotted at (a) analysis window length 1sec and hop 0.5sec, (notexture window applied) (b) texture window length 5sec and hop of 1sec. The distinctionin the taan section feature values is more evident in (b) than in (a)
6.2.1 Pitch Based Features
The DFT spectrum (128 point) of 1sec segments (1sec = 100 pitch values at 10 ms sampling
of pitch) of the pitch contour is computed using a sliding analysis frame of 1sec with hop size
of 500ms. After calculating the DFT, features- Energy Around Maximum Amplitude in DFT
(EAMA) and Frequency Corresponding to Maximum Amplitude in DFT (FCMA) are extracted
from it. For calculating EAMA feature, 2 bins before and after the Maximum Amplitude in DFT
are considered ( 3.9Hz). This is done to overcome the bin resolution limitations. Equations for
FCMA and EAMA are as given in eq. 6.2 and eq. 6.3 respectively. Here, Maximum Amplitude
Value in DFT (MAV) of an analysis frame is given by eq. 6.1
MAV = max(|Z(k)|2) (6.1)
FCMA = fMAV (6.2)
EAMA =
kMAV +2∑k=kMAV −2
|Z(k)|2 (6.3)
where Z(k) is the DFT of the mean-subtracted pitch trajectory z(n) with samples at 10ms
intervals, and kMAVHz is the frequency bin closest to fMAV Hz.
29
Chapter 6. Taan Section Detection 30
Figure 6.6: Shows the (a) taan frame values corresponding to 2 features plotted in blackcircles for Bada Khayal and blue crosses for Chota Khayal (b) Gaussians plotted usingthe mean and variance of the taan features in Bada Khayal and Chota Khayal to visualizetheir overlap
6.2.2 Energy Fluctuation Rate
The energy values corresponding to only the vocal melody are also available at every 10ms along
with the pitch. To capture the variations in energy contour as seen in Figure 6.2(b), mean value
is first subtracted from the energy contour in an analysis frame of 1sec and then the number of
zero crossings detected are used as a feature value for that frame. Here as well, we use hop of
500ms for the analysis frame of 1sec and average these values over texture frames of 5sec with
1sec hop.
6.3 Inspection of Variability of Features with Tempo
As described in the 4.1.1(iii), over the concert duration, gradually, the tempo keeps increasing.
We want to inspect if the features proposed are invariant to the tempo variation. We use the
data Subset A to evaluate this. Generally an abrupt change in tempo is seen at the start of
Chota Khayal section. The timing of this starting point of Chota Khayal can be used for this
purpose. Figure 6.6(a) shows feature values plotted across the data Subset A with the high over-
lap between taan feature values in Bada Khayal and Chota Khayal section. The mean values
of Bada Khayal and Chota Khayal taan features are -0.0302, -0.0458 and 0.0761, 0.1151 respec-
tively for the scatter plot over 2 features, while the variance is 0.0325,0.0291 and 0.0507,0.0487
respectively. These feature values are close and their overlap can be visualized as seen in Figure
6.6(b) with the help Gaussian contours plotted using the mean and variance values mentioned
above.
Euclidean distance between the Bada Khayal and Chota Khayal taan features was calcu-
lated. Euclidean distance was also calculated of taan features within the Bada Khayal and
within Chota Khayal sections. Histogram of these distance values is plotted as seen in Figure
6.7. The mean (0.6) of the distance calculated of taan feature vector across Bada Khayal and
Chota Khayal was comparable with the within Bada Khayal and Chota Khayal distances’ means
(0.49, 0.59 respectively). Thus, both quantitatively and also from the figure it is evident that
30
Chapter 6. Taan Section Detection 31
Figure 6.7: Histogram of Euclidean distances in the feature values from lower tempoBada Khayal section and higher tempo Chota Khayal section are comparable with thedistances within Bada Khayal and those within Chota Khayal
the feature values are tempo independent.
6.4 Classification and Grouping using Posteriors
A frame-wise classification into taan and non-taan styles is carried out for all frames in the
vocal segments by a trained MLP network. We use a feed-forward architecture with a sigmoid
activation function for the hidden layer comprising 300 neurons. Training uses cross-entropy
error minimization via the error back propagation algorithm. We compare the frame level
accuracy of MLP with SVM classifier in the data Subset B using leave one song out validation.
As seen in table 6.1 the MLP is seen to perform better than the SVM, thus we choose to proceed
with MLP. We also give the accuracy using the Deep Belief Network (DBN) with various number
of hidden layers of 300 neurons. We used 2, 3 and 4 hidden layers and their accuracies were
93.54%, 93.29% and 93.45% respectively. Also the number of neurons was experimented with,
by changing it from 100 to 1000 in steps of 50. No significant change in accuracy was observed
with change in the number of neurons. Since the data is not large in number, we decided to use
less number of neurons and with just 1 hidden layer. The precision and recall for taan and non-
taan were similar to MLP results. Upon classification, the recall and precision of taan frame
detection with respect to the ground-truth can serve to measure the discriminative power of
the features. In our case however we seek to label continuous regions of the audio rendered in
taan style much as a human annotator would. This requires the grouping of frames based on
homogeneity with respect to the taan characteristics. Novelty detection based on a self-distance
matrix is an effective way to find segment boundaries [6].
We use a recently proposed approach to computing the SDM from the posterior probabil-
ities derived from the features rather than the features themselves [1]. The use of posterior
probabilities for computation of SDM helps to obtain enhanced homogeneity due to the reduced
sensitivity to irrelevant local variations. Euclidean distance between vectors comprising of the
posteriors is used for calculating the SDM. The posteriors are the class-conditional probabili-
ties obtained from the MLP classifier for each test input frame. Points of high contrast in the
31
Chapter 6. Taan Section Detection 32
Table 6.1: Comparison of frame-wise accuracies of SVM and MLP with Precision andRecall values for each class
SVM91.74%
Predicted MLP93.58%
PredictedPrecision Recall Precision Recall
Labeltaan 0.7018 0.8928
Labeltaan 0.7962 0.8216
non-taan 0.9768 0.9224 non-taan 0.9586 0.9647
SDM are detected by convolution along the diagonal with a checker-board kernel [12] whose
dimensions depend upon the desired time scale of segmentation. Considering the minimum
taan episode duration, this is chosen to be 5sec in the interest of obtaining reliable boundaries
with false negatives. The resulting novelty function is searched for peaks, representing segment
boundaries, using ‘local peak local neighborhood’ [7]. Whether a region between two detected
boundaries corresponds to a taan is determined by examining the majority of the frame-level
classification in that region. Finally, the highest level of grouping is obtained by examining the
region of audio separating every two detected taan segments. A simple heuristic is set up to
mimic the musicians’ annotation, as discussed in section 4.3, such that taan episodes separated
by non-taan vocal activity within 10sec are merged into a single section. The merging is also
applied if the separation corresponds to a purely instrumental region of duration within 50sec.
The intermediate step of frame-wise classification is working well using MLP as can be seen
from the accuracy of taan v/s non-taan classification. The evaluation of the grouping is described
in the next chapter.
32
Chapter 7
Experiments and Evaluation
Our ideal system would detect and segment taan sections similar to a musicians’ labeling. This
high level task is attempted by the sequence of frame-level automatic classification and higher
level grouping as described in section 6.4. We perform 2 types of experiments, first with the
smaller artist specific data Subset B for deciding the best operating point for conversion of MLP
posteriors into labels. The second experiment is using the parameters decided in the data Subset
B on the data Subset D.
We present experimental results on the performance of each of the modules. Frame-level clas-
sification is measured by the detection of taan in terms of recall and precision. Artist-dependent
and artist-independent training are compared within the 22 concert database. The frame-level
classification needs frame-level (i.e. 1sec resolution) annotation of taan presence or absence.
This is required both for the training of the classifiers as well as for reliable testing. The mu-
sician labels are not useful as such for this end due to the presence of non-taan interruptions
of significant duration within the musician labeled taan sections as seen before. Thus, for the
development of the frame-level classifier, we need a finer marking of taan segments. Since this
is a demanding task to carry out manually, we use a bootstrapped iterative approach of SAD
as described in section 4.1.1(ii). SAD evaluation showed that the frame-level labels so obtained
were indeed accurate and these were then used to train and evaluate the frame level classifiers.
The system is also evaluated after grouping, this time in terms of the match between the de-
tected segments and the subjectively labeled taan segments for each concert. We tabulate the
taan episode detection by reporting the number of over-, under-segmentation, exact detection,
false negatives and false positives.
Conventional measures of performance include computing of cluster purity, pair wise ham-
ming distance, boundary precision and recall, etc as detailed in [24]. Any two of the measures
carefully selected will give us an idea about the type of segmentation that is achieved. Since our
task is of detection, along with these measures we need to report the number of detections as
well. Cluster purity and boundary retrieval were used by [1] to report their segmentation eval-
uation. We also report the cluster purity and boundary retrieval values, as per these standard
evaluation measures, in table 7.1 for completeness and to emphasize that they are not enough
to give a picture of taan section detection. The high cluster purity values (close to 1 is good)
show that the detected taan and non-taan sections are homogeneous as also there is over and
33
Chapter 7. Experiments and Evaluation 34
Figure 7.1: Various scenarios that occur after grouping viz. (a) false positive, (b) over-segmentation, (c) exact detection, (d) false negative, (e) under-segmentation
under-segmentation but we do not get an idea about the number of detected taan sections. Same
is the case for the boundary retrieval values reported. The boundary retrieval values also reflect
the under and over segmentation but their number is not conveyed.
Table 7.1: Performance evaluation using conventional measures on (a) proposed systemand (b) GMM based system of [1] after grouping at frame-level
Boundary Retrieval Cluster PurityMethod Precision Recall acp asp k
(a) MLP based 0.448 0.578 0.768 0.779 0.773(b) GMM based 0.2187 0.4062 0.763 0.621 0.6869
To the best of our knowledge, however, no attempt has been made towards the problem of
using supervised classification for labeling of the segments and especially in the case of taan sec-
tion detection problem in Hindustani vocal concerts. It is not possible to compare our results
with any other method apart from [1] which works on Dhrupad and instrumental concerts. Here,
they do not label their segments which we have modified in the case of taan to perform labeling.
Also the conventional measures of performance will not be able to give a fair idea about the
number of sections marked in the ground-truth which have been correctly labeled and retrieved.
Along with the frame-level accuracy we measure the performance by including the number of
correctly retrieved taan segments and number of false positives. A section is said to be correctly
retrieved if there is an overlap of at least 50% of its duration with a detected segment. Also of in-
terest is the extent of over- or under-segmentation of the correctly detected taan sections. Figure
7.1 illustrates the different possibilities of mismatch that are observed between subjective labels
and automatically labeled sections. When subjectively labeled section is correctly detected, it
is observed that the onset and offset boundaries are always within 5sec of the corresponding
ground-truth boundaries indicating the reliability of the posteriors based detection of sections.
7.1 Evaluation on data Subset B
Our audio data subset B consists of 57 Khayal vocal concert recordings partitioned into two
distinct sets of 22 single-artist (Pt. Jasraj) concerts, and 35 multi-artist concerts (that do not
34
Chapter 7. Experiments and Evaluation 35
Figure 7.2: Shows (a) ROC by thresholding the posterior values obtained from MLP forleave-one-song-out case of 22 concerts and for 35 train and 22 test concert scenario and (b)SDM+novelty+Grouping for 35 train set and 22 test set scenario using MLP posteriors,the pink boxes indicate the musician marked ground truth, the red stars indicate the MLPlabels, the novelty score is black continuous contour, the green filled boxes show detectedtaan regions before grouping and the circled peaks connected by blue lines show the finaltaan episodes after grouping.
contain Pt. Jasraj). In both cases a number of different ragas are covered at various tempi. All
artists are male. The 22 concert set is treated as the test set with two different training con-
ditions: artist specific training via leave-one-song-out cross validat