16
________________________________ Audio-based Bird Species Classification ________________________________ Final Report SGN 81006- Signal Processing Innovation Project Version 1.1, June 2015 Shriram Nandakumar Student Number: 244935 [email protected] Client: Tuomas Virtanen, Department of Signal Processing, Tampere University of Technology. Expected Credits: 7 Client: Course Responsible: _____________________________ __________________________ Tuomas Virtanen Sari Peltonen

Audio based Bird Species Recognition

Embed Size (px)

Citation preview

Page 1: Audio based Bird Species Recognition

________________________________

Audio-based Bird Species

Classification ________________________________

Final Report

SGN 81006- Signal Processing Innovation Project

Version 1.1, June 2015

Shriram Nandakumar

Student Number: 244935

[email protected]

Client: Tuomas Virtanen, Department of Signal Processing, Tampere University of Technology.

Expected Credits: 7

Client: Course Responsible:

_____________________________ __________________________

Tuomas Virtanen Sari Peltonen

Page 2: Audio based Bird Species Recognition

Version history

SL No. Date Version Person Description

01. 14.05.2015 1.0 Shriram N Preliminary version

02 01.06.2015 1.1 Shriram N Final version

Page 3: Audio based Bird Species Recognition

Abstract

This document presents the details of the project on hierarchical classification of bird species

using audio information. The project finds immediate applications in biodiversity conservation

and is also highly beneficial for the machine learning community. The preliminary tasks

included understanding of the data and its taxonomy, followed by retrieval and organization.

Tools were then developed to visualize the hierarchical class structure. Frame-level MFCC

features were extracted from the raw audio files and they were used to train multi-class

classifiers at the leaf level of the taxonomy. Naïve Bayes and k-Nearest Neighbor Classifiers

were investigated. Due to the challenging nature of database and a wider flat-class structure,

the recognition rates were remarkably poor. Nevertheless, as the next step, hierarchical

classification with local classifiers was attempted. Specifically, Local Classifier per Parent

Node (LCPN) approach was used. The performances were assessed using hierarchical

Precision (hP), hierarchical Recall (hR) and hierarchical F-score (hF). There was no significant

improvement in the performance and hence opens up opportunities for further investigation.

Page 4: Audio based Bird Species Recognition

CONTENTS

1. Introduction …………………………………………………………….. 1

2. Summary of the project …………………………………………………

2.1 Project Organization …………………………………………….......

2.2 Project Objectives …………………………………………………...

2.3 Project Resources …………………………………………………...

1

1

1

1

3. Project Implementation Details …………………………………………

3.1 Database & Taxonomy ……………………………………………...

3.2 Class labeling …………………………………………….................

3.3 Feature Extraction ……………………………………………..........

3.4 Training & Test Data Preparation …………………………………..

3.5 Classifiers used ……………………………………………...............

3.6 Hierarchical Classification ………………………………………….

3.6.1 Flat Classification …………………………………………...

3.6.2 Local Classifier per Parent Node approach …………………

3.6.3 Performance Measures ……………………………………...

1

2

2

2

5

5

5

6

7

8

4. Project Realization ……………………………………………...............

4.1 Workload division ……………………………………………..........

4.2 Meetings with the client …………………………………………….

4.3 Problems, changes and delays ………………………………………

4.4 Budget ……………………………………………...……………….

4.5 Lessons Learnt ……………………………………………................

8

8

8

8

8

10

5. Project results and conclusions …………………………………………. 10

6. Comments on the course ..………………………………………………

7. References ……………………………………………...……………….

11

11

Page 5: Audio based Bird Species Recognition

List of Abbreviations

DAG Directed Acyclic Graph

LCPN Local Classifier per Parent Node

MFCC Mel-Frequency Cepstral Coefficients

NB Naïve Bayes

𝑘-NN 𝑘- Nearest Neighbours

Page 6: Audio based Bird Species Recognition

pg. 1

1. Introduction

Automatic classification and recognition of bird species by their acoustic cues has been a

subject of interest to ornithologists, ecologists, biodiversity conservationists and pattern

detection researchers for many years [1]. Birds have been used widely as indicators of

biodiversity because they provide critical ecosystem services, respond rapidly to change, are

relatively easy to detect, and may reflect changes at lower trophic levels (e.g., insects, plants)

[2]. Hence the immediate and often cited application of the project will be automatic bird

population surveys.

In many application fields, taxonomies and hierarchies are natural ways to organize and

classify objects. Most of the Machine Learning research has largely focused on flat target

prediction, where the output is a single binary or multi-valued scalar variable [3]. The natural

taxonomical structure of the bird species lends itself an ample scope for research in machine

learning. Most of the work in the area has been on signal representation, noise removal, feature

extraction and flat target classification; hierarchical classification is seldom attempted [4].

2. Summary of the project

2.1 Project Organization

The project, in all its stages, was done by a single person. The client was Tuomas Virtanen

([email protected]) representing the Audio Research Team (http://arg.cs.tut.fi) at

Tampere University of Technology.

2.2 Project Objectives

The client had given the author multiple objectives. Initially methods had to be found out to

successfully retrieve a subset of the database from Xeno-Canto [5] and organize them. The

author also had to develop tools to visualize the hierarchy. Suitable feature extraction and

classification tools had to be used to build an audio-based bird species recognition system.

Emphasis had to be given on the implementation of hierarchical classification rather than

improving performance. The evaluation results had to be finally reported.

The personal goals of the author were to successfully apply the learned signal processing and

machine learning methods and gain a hands-on experience. As a person more interested in the

underlying mathematics of the whole process, the author also wanted to investigate the problem

on a more mathematical framework.

2.3 Project resources

This section lists the project resources that were needed to carry out the project:

1. Audio data from Xeno-Canto [5], which the author had to find ways to obtain.

2. A personal computer equipped with good computing resources (CPU 2+ GHz, 2 Gb RAM)

and ample disc space (at least 20Gb). The author used his own laptop computer.

3. Matlab Software.

4. Openly available tools for feature extraction (MFCC calculation) and classification [6].

3. Project Implementation Details

This chapter discusses the implementation of the developed audio-based bird species

classification system. It starts with the general description of the retrieved data and the class

taxonomy details. It is followed by the description of feature extraction module and the details

Page 7: Audio based Bird Species Recognition

pg. 2

of the classifiers used. An introduction to hierarchical classification is provided along with the

performance measures used in assessment.

3.1 Database & Taxonomy

In order to solve the bird species identification problem in a machine learning framework, it

was necessary to have a database of bird recorded songs labelled with their corresponding

species. The site http://www.xeno-canto.org/ [5] contains an extensive list of audio recorded

songs for bird species, and also the scientific taxonomy of each species. So, a database with

audio records of birds had to be obtained from the site by an information extraction procedure.

The taxonomical details of the employed dataset are provided in Table 1. The dataset is

composed of 3435 records of bird songs obtained across 48 species that appear in South

Atlantic Coast of Brazil [4]. The recordings are not standardized, recorded in different

environments and corrupted with sounds from co-habitants and other sources of noises such as

wind, rain and vehicles in the background.

For flat classification, only the species was taken into account and there are all together 48

species. For hierarchical classification, both the order and family were also taken into account.

As there is a unique genus for every species in the taxonomy of the employed dataset, species

was skipped from the hierarchy. Hence a 3-level hierarchy was taken into consideration for

classification. Table 1 is visualized as a tree in Figure 2 where the nodes are the class labels.

3.2 Class Labelling

In order to perform classification, the audio samples had to be tagged with appropriate labels.

For flat classification, the class labels were given as numbers in the range {1, 48}. For

hierarchical classification, class labels were given for every level of hierarchy. The leaf nodes

of the tree have 4 components in their class-labels. For example, the 1st leaf from the left has

the class-label 0.1.1.1 (0 indicates the root of the tree) and the last leaf has the class-label

0.4.1.3.

3.3 Feature Extraction The main task of the project was to concentrate on the implementation of hierarchical

classification. Hence the ubiquitous static MFCCs are used as features. This choice is justified

by the fact that MFCCs are most commonly used for solving various audio related pattern

recognition problems and have shown robustness in characterizing the amplitude spectrum,

which corresponds to the way the human auditory system processes audio. The openly

available VOICEBOX speech processing toolbox [6] was used for this purpose. The default

parameters (Frame length= power-of-2 < 0.03*sampling frequency in Hz, 50% overlap

between frames, number of cepstral coefficients=12, Hamming window in time-domain,

triangular filers in mel domain) were used. The details of the steps involved are out of scope

for the report.

The statistical averages (mean and standard-deviation) of the frame-wise MFCCs computed

henceforth were used as features. The feature extraction stage thus yielded a representation of

the raw audio signal as a 24-dimensional feature vector (the first 12 components representing

the mean of the MFCCs and the remaining 12 the standard deviation).

Page 8: Audio based Bird Species Recognition

pg. 3

Table 1.

A schema of the employed hierarchy [4]

Page 9: Audio based Bird Species Recognition

pg. 4

00

.10

.20

.30

.40

.50

.60

.70

.80

.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

heig

ht =

3

0

12

34

11

21

23

45

67

89

10

1

12

11

11

11

23

12

12

34

56

78

91

01

11

21

23

45

67

89

12

31

23

45

67

81

23

343

5

73

160

306

21

40

73

1303

01

049

63

32

02

49

784

154

802

164

674

140

20

53

1303

01

049

63

37

86

16

32

32

68

73

08

42

93

25

67

31

567

38

18

38

96

51

038

51

507

74

79

97

09

08

14

37

64

58

41

585

44

71

81

442

71

422

97

73

4

Fig

ure

1. C

lass

Hie

rarc

hy V

isual

izat

ion

Page 10: Audio based Bird Species Recognition

pg. 5

3.4 Training Data & Test Data Preparation Every 24-dimensional feature vector was given the appropriate class label in order to perform

a supervised classification. The resulting dataset was divided into training and test sets in the

ratio 70:30. Care was taken that all the 48 species were covered in both training and test sets.

Since simple classifiers were used, no cross-validation was required.

3.5 Classifiers used Simple multi-class classifiers, which can be easily extended to the problem of hierarchical

classification, was the main requirement. Hierarchical classification of text is a well-studied

problem and Naïve Bayes is the most commonly used classifier. Naturally, it was the first

choice for the classifier. Moreover due to the popularity and proven efficiency of k-Nearest

Neighbors in audio classification tasks, it was also strongly considered. The possibility of using

plain-vanilla neural networks was abandoned due to limited computing power available even

though the author tried using it.

3.6 Hierarchical classification Hierarchical classification is a type of structured classification problem where the output of

the classification algorithm is defined over a class taxonomy. The term structured classification

is broader and denotes a classification problem which has some structure (hierarchical or not)

among the classes. “A class taxonomy is a tree structured regular concept hierarchy defined

over a partially order set (C,≺), where C is a finite set that enumerates all class concepts in the

application domain, and the relation ≺ represents the “IS-A" relationship” [7].

- The only one greatest element “R” is the root of the tree.

- ∀𝑐𝑖, 𝑐𝑗 ∈ 𝐶, if 𝑐𝑖 ≺ 𝑐𝑗 then 𝑐𝑗 ⊀ 𝑐𝑖.

- ∀𝑐𝑖 ∈ 𝐶, 𝑐𝑖 ⊀ 𝑐𝑖. - ∀𝑐𝑖, 𝑐𝑗 , 𝑐𝑘 ∈ 𝐶, if 𝑐𝑖 ≺ 𝑐𝑗 and 𝑐𝑗 ≺ 𝑐𝑘 then 𝑐𝑖 ≺ 𝑐𝑘.

A class taxonomy can be a tree or a Directed A-cyclic Graph (DAG). This report does not

talk about DAGs as the methods used in them are very different.

Figure 2. An example of a tree-based hierarchical class structure

The two types of conventional classification methods: two-class and multi-class classifiers

cannot directly cope with hierarchical classes [7]. In the context of hierarchical classification

most approaches could be multi-label as well. For instance, considering the hierarchical class

structure presented in Figure 2 (where R denotes the root node), if the output of a classifier is

22

.1

2.

1

R

1.2

1 2

1.1 2.1 2.2

2.2.2 2.2.1 1.2.2 1.2.1

Page 11: Audio based Bird Species Recognition

pg. 6

class 2.1.1, it is natural to say that it also belongs to classes 2 and 2.1, therefore having three

classes as the output of the classifier.

Hierarchical classification methods differ in a number of criteria. The first criterion is the

type of hierarchical structure used- tree or DAG. As previously mentioned, DAG will not be

considered here. The second criterion is related to how deep the classification in the hierarchy

is performed. i.e., the hierarchical classification method can be implemented in a way that will

always classify a leaf node (often termed in the literature as mandatory leaf node prediction)

or the method can consider stopping the classification at any node in any level of the hierarchy

(non-mandatory leaf node prediction).

The third criterion is related to how the hierarchical structure is explored. The current

literature often refers to top-down (or local) classifiers, when the system employs a set of local

classifiers; big-bang (or global) classifiers, when a single classifier coping with the entire class

hierarchy is used; or flat classifiers, which ignore the class relationships, typically predicting

only the leaf nodes.

In this work only flat classifiers and one method of top-down (local) classifiers called Local

Classifier per Parent Node (LCPN) were used. For more information on the definition, scope

and details of hierarchical classification, the reader should refer to [7].

3.6.1 Flat Classification The simplest approach to hierarchical classification is to completely ignore the class

hierarchy, typically predicting only classes at the leaf nodes [7]. This approach is more like a

traditional classification algorithm during training and testing. However, indirectly it provides

a solution for hierarchical classification, because, when a leaf class is assigned to an example,

one can consider that all its ancestor classes are also implicitly assigned to that instance.

However, this very simple approach has the serious disadvantage of having to build a classifier

to discriminate among a large number of classes (all leaf classes), without exploring

information about parent-child class relationships present in the class hierarchy [7]. Figure 3

illustrates this approach.

Figure 3. Flat classification using flat multi-class classification algorithm. Circles represent

classes and shaded circles represent flat classes over which a single multi-class classifier is

trained.

22

.1

2.

1

R

1.2

1 2

1.1 2.1 2.2

2.2.2 2.2.1 1.2.2 1.2.1

Page 12: Audio based Bird Species Recognition

pg. 7

3.6.2 Local Classifier per Parent Node (LCPN) Approach For every parent node in the class hierarchy, a multi-class classifier is trained to distinguish

between its child nodes. Figure 4 illustrates this approach.

3.6.2.1 Training phase In order to train the classifiers either the “siblings" policy or the “exclusive siblings" policy

can be employed. The notations that are used to concisely explain the policies are listed in

Table 2.

Figure 4. Local Classifier per Parent Node. Circles represent classes and partially shaded

circles represent multi-class classifiers predicting their child classes.

Table 2. Notations for local classifiers [7]

Symbol Meaning

𝑇𝑟 Set of all training examples

𝑇𝑟+(𝑐𝑗) Set of positive training examples of 𝑐𝑗

𝑇𝑟−(𝑐𝑗) Set of negative training examples of 𝑐𝑗

↑ (𝑐𝑗) Parent category of 𝑐𝑗

↓ (𝑐𝑗) Set of children categories of 𝑐𝑗

⇑ (𝑐𝑗) Set of ancestor categories of 𝑐𝑗

⇓ (𝑐𝑗) Set of descendent categories of 𝑐𝑗

↔ (𝑐𝑗) Set of sibling categories of 𝑐𝑗

∗ (𝑐𝑗) Examples whose most specific known class is 𝑐𝑗

Siblings policy: 𝑇𝑟+(𝑐𝑗) =∗ (𝑐𝑗) ∪ ⇓ (𝑐𝑗) and 𝑇𝑟−(𝑐𝑗) =↔ (𝑐𝑗) ∪ ⇓ (↔ (𝑐𝑗))

Exclusive siblings policy: 𝑇𝑟+(𝑐𝑗) =∗ (𝑐𝑗) and 𝑇𝑟−(𝑐𝑗) =↔ (𝑐𝑗)

3.6.2.2 Testing phase The testing phase can be best explained with an example. Considering Figure 4, suppose that

the first level classifier assigns the example to the class 2. The second level classifier, which

was only trained with the children of the class node 2, in this case 2.1 and 2.2, will then make

its class assignment (and so on, if deeper-level classifiers were available).

22

.1

2.

1

R

1.2

1 2

1.1 2.1 2.2

2.2.2 2.2.1 1.2.2 1.2.1

Page 13: Audio based Bird Species Recognition

pg. 8

3.6.3 Performance Measures

When dealing with hierarchical classification problems, it is necessary to use evaluation

measures that are appropriate for these types of problems. In this work we have used the metrics

of hierarchical precision (hP) , hierarchical recall (hR) and hierarchical F-score (hF) as defined

by Kiritchenko et al in [8]. The formulas for computing the measures are as follows:

ℎ𝑃 =1

|𝐼|∑

|�̂�𝑖 ∩ �̂�𝑖|

|�̂�𝑖|𝑖

ℎ𝑅 =1

|𝐼|∑

|�̂�𝑖 ∩ �̂�𝑖|

|�̂�𝑖|𝑖

ℎ𝐹 =2 ∗ ℎ𝑃 ∗ ℎ𝑅

ℎ𝑃 + ℎ𝑅

where 𝐼 is the set of all test examples, �̂�𝑖 is the set consisting of the most specific class predicted

for test example 𝑖 and all its ancestor classes, �̂�𝑖 is the set consisting of the true class of test

example 𝑖 and all its ancestor classes. The motivation for these measures are clearly discussed

in detail in [7] and [8].

4. Project Realization

This section enlists the non-technical aspects of the project such as the total labour hours

spent, division of the workload amongst various implementation steps, planned and realized

tasks, problems faced and lessons learnt from the project.

4.1 Workload Division

The project started commencing from the 3rd week of January 2015 with introductory

seminars. The actual work started from 1st February 2015 after the finalization of project topic.

The overall implementation of the project and the division of labor hours is visualized in figure

5.

4.2 Meetings with the client

There were regular e-mail exchanges and bi-weekly personal meetings with the client.

4.3 Problems, delays and changes

The planned schedule was realized almost unchanged. There were multiple unforeseen delays

due to various reasons. Data retrieval was not easy and there was confusion regarding the

choice between Macaulay library database (http://www.birds.cornell.edu/) and Xeno Canto [5],

finally settling down with the latter. Data from Xeno Canto was not easily downloadable and

some basic web programming had to be done, which the author had never been exposed to.

The feature extraction step, on the other hand, was quite easy with the readily available tool-

box and took less time than anticipated. There were delays when the author encountered poor

performance accuracies in flat classification with several classifiers. With the wrong

assumption of improving flat classification performance in terms of classification accuracy, the

author spent more time on that task than what was planned. Moreover the exploration of neural

network training and multiple crashes that the machine faced due to massive training data

Page 14: Audio based Bird Species Recognition

pg. 9

further delayed the progress. Around the same time, the author also had some health issues due

to Seasonal Affective Disorder. The project gathered the lost momentum with the

implementation of hierarchical classification.

4.4 Budget

With the 190 labor hours that were spent to achieve the tasks and with 14e/h labor cost, the

total labor cost comes to around 2660 euros.

Figure 5. Project Implementation Chart

16%

10%

5%

11%

11%

13%

13%

13%

8%

Division of 190 labour hours

Familiarization with taxonomy, data retrieval and organization (for flat classification)

Literature Review

Writing project plan report

Feature extraction & preparation of training & test sets

Flat Classification, exploration of suitable classifiers & choice of features

Development of visualization tool for class hierarchy

Implementation of hiearchical classification & evaluation

Writing final report

Attending seminars & presentations

Page 15: Audio based Bird Species Recognition

pg. 10

4.5 Lessons learnt

With massive work load from other courses and unforeseen health issues, the author learnt

valuable lessons on time management. He also learnt the importance of getting clarified with

any technical difficulty by openly talking to the client/guide instead of pondering over it by

himself.

5. Project Results and conclusions The first important result of the project was the visualization of class hierarchy for the bird

sound database, which was already introduced in Figure 1.

Figure 6 shows the main results of hierarchical classification. Both flat and hierarchical

(LCPN approach) classification schemes were compared in terms of hierarchical precision,

recall and F-score measures. The number of neighbours 𝑘 as in 𝑘-Nearest Neighbours was

varied together between 1 and 10 for all local classifiers. In other words no two local classifiers,

irrespective of their position in class hierarchy, can have different 𝑘 values. This is just to avoid

unnecessary exhaustive analysis. The performance of hierarchical Naïve Bayes classifier is also

shown in the same figure for easy comparison.

Figure 6. Performance measures for various classifiers.

It can be observed from Figure 6 that hierarchical classification by LCPN has an edge over

flat-classification approach, albeit not significantly. The reasons are multifold- one reason can

be due to the presence of background noises in the audio data and this may call for the need of

pre-processing by identifying the bird song segments from the raw data. Another reason can be

due to unbalanced class hierarchy, in that most bird species are of Passeriformes order, which

means there is not much information about the class hierarchy that has to be carried over to the

leaf nodes. These are open questions which demand more thorough investigation. Other

hierarchical classification methods, especially big-bang (global) classifiers [7] can also be

explored.

1 2 3 4 5 6 7 8 9 10

0.4

0.5

0.6

0.7

0.8

0.9

1

k- Number of Neighbors

hP(Flat k-NN)

hR(Flat k-NN)

hF(Flat k-NN)

hP(LCPN k-NN)

hR(LCPN k-NN)

hF(LCPN k-NN)

hP(LCPN Naive Bayes)

hR(LCPN Naive Bayes)

hF(LCPN Naive Bayes)

Page 16: Audio based Bird Species Recognition

pg. 11

6. Comments on the course The course provided the author a hands-on experience by giving an opportunity to apply the

learned methods and do real programming. It also gave him a chance to hone his project

management skills at all levels. The mandatory seminars, especially on report writing, were

highly useful and helped in refining the author’s writing skills. Overall, the course was

extremely satisfying and the author offers his thanks to his guide/client Tuomas Virtanen for

giving him an opportunity and the co-ordinator Sari Peltonen for the smooth conduct of the

course.

7. References

[1] Z.Chen & R.C.Maher, “Semi-automatic classification of bird vocalizations using spectral

peak tracks,” Journal of Acoustic Society of America, vol. 120, pp. 2974-2982, 2006.

[2] F.Briggs, R.Raich, X.Z. Fern. “Audio Classification of Bird Species: a Statistical Manifold

Approach,” in Proc. of the 9th IEEE Int. Conf. on Data Mining, 2009, pp. 51-60

[3] N.C. Bianchi, C. Gentile, L. Zaniboni. “Hierarchical classification: combining Bayes with

SVM,” Proc. of the 23rd International Conference on Machine Learning, 2006

[4] C.S.Silla Jr, C.A.A. Kaestner, “Hierarchical Classification of Bird Species Using Their

Audio Recorded Songs,” IEEE International Conference on Systems, Man, and Cybernetics,

2013, pp. 1895-1900.

[5] Xeno Canto, Sharing bird sounds from around the world [online], Available:

http://www.xeno-canto.org/

[6] Mike Brookes, Department of Electrical & Electronic Engineering, Imperial College,

London, VOICEBOX: Speech Processing Toolbox for Matlab [online], Available:

http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 10 [7] C.N.Silla Jr. A.A.Freitas, “A Survey of Hierarchical Classification Across Different

Application Domains,” Data Mining and Knowledge Discovery, Vol. 22, pp. 31-72, 2011.

[8] Kiritchenko S, Matwin S, Famili AF, “Functional annotation of genes using hierarchical

text categorization,” Proc. of the ACL Workshop on Linking Biological Literature, Ontologies

and Databases: Mining Biological Semantics, 2005.