Harbour Porpoise Click Train Classification with …kth.diva-portal.org/smash/get/diva2:1146181/FULLTEXT01.pdfDEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM,

IN DEGREE PROJECT ELECTRICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Harbour Porpoise Click Train Classification with LSTM Recurrent Neural Networks

FILIP ÄRLEMALM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING

Harbour Porpoise Click Train Classification withLSTM Recurrent Neural Networks

Filip Karl Oskar Ärlemalm

September 19, 2017

Abstract

The harbour porpoise is a toothed whale whose presence is threatened in Scandinavia. Onestep towards preserving the species in critical areas is to study and observe the harbourporpoise population growth or decline in these areas. Today this is done by using underwateraudio recorders, so called hydrophones, and manual analyzing tools. This report describes amethod that modernizes the process of harbour porpoise detection with machine learning. Thedetection method is based on data collected by the hydrophone AQUAclick 100. The data isprocessed and classified automatically with a stacked long short-term memory recurrent neuralnetwork designed specifically for this purpose.

i

ii

Sammanfattning

Vanlig tumlare är en tandval vars närvaro i Skandinavien är hotad. Ett steg mot att kunnabevara arten i utsatta områden är att studera och observera tumlarbeståndets tillväxt ellertillbakagång i dessa områden. Detta görs idag med hjälp av ljudinspelare för undervattens-bruk, så kallade hydrofoner, samt manuella analysverktyg. Den här rapporten beskriver enmetod som moderniserar processen för detektering av vanlig tumlare genom maskininlärning.Detekteringen är baserad på insamlad data från hydrofonen AQUAclick 100. Bearbetning ochklassificering av data har automatiserats genom att använda ett staplat återkopplande neuraltnätverk med långt korttidsminne utarbetat specifikt för detta ändamål.

iii

iv

Table of Contents

1 Introduction 01.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

1.1.1 AQUAclick 100 and AQUAClickView . . . . . . . . . . . . . . . . . . . . . . 01.1.2 Limits of AQUAclick 100 and AQUAClickView . . . . . . . . . . . . . . . . 01.1.3 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0

1.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Ethical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Harbour porpoises 22.1 General facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Click characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2.1 Harbour porpoise clicks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.2 Click duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.3 Frequency spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.4 Amplitude ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Click train characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.1 Inter-click interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.2 Deviation from the median . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.3 Inter-click difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.4 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Machine learning 83.1 Supervised and unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Training, validation and test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4 Overfitting and underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.6 Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.7 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 LSTM recurrent neural networks 114.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 The memory cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4 Weights and biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5 Forget, input and output gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.5.1 The forget gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5.2 The input gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.5.3 The output gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.6 Training the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.6.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.6.3 Gradient stochastic descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.6.4 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.6.5 Batch and epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.7 Stacked networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Click train extraction and processing 175.1 Extracting click trains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.1 Reshape for input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.3 Truncation and normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 Reverb detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v

6 Click train classification 196.1 Manual classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Automated classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2.2 LSTM network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2.3 Network layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Implementation setup 227.1 AQUAClickView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.2 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3 Python and Python libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7.3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3.2 Keras and Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3.3 Numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3.4 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3.5 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3.6 Tkinter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

8 Implementation 248.1 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

8.1.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.1.2 Manual classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

8.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.2.2 Graphical user interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.2.3 Classification levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258.2.4 Click train data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268.2.5 Click train plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268.2.6 Click train sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288.2.7 Adaptive click duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298.2.8 Adaptive frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298.2.9 Adaptive cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298.2.10 Insert real porpoise click . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

9 Evaluation 309.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9.1.1 Precision and recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.1.2 The Fβ score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.1.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.1.4 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9.2 The classification threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319.2.1 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319.2.2 Precision-recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319.2.3 Fβ optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

10 Results 3310.1 Final network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3310.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3310.3 Optimizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3410.4 Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

11 Discussion 3811.1 General results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3811.2 Dataset improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3811.3 Network improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vi

1 Introduction

1.1 Background and motivation

1.1.1 AQUAclick 100 and AQUAClickView

Kolmården Wildlife Park has extensively collected data from harbour porpoises in the Baltic sea,Skagerrak and Kattegat. Some of this data has been collected with a hydrophone specificallydesigned for recording sounds from harbour porpoises, and to some extent dolphins, called theAQUAclick 100. The AQUAclick 100 and its accompanying software AQUAClickView has beendeveloped and is owned by Aquatec Group Ltd. The AQUAclick 100 does not store high frequencysound files but instead filters the data already at the recording stage. This process is called loggingand only data that is considered relevant is saved at the recording stage. This logged data collectedby the AQUAclick 100 can then be studied and viewed in AQUAClickView. This software allowsthe user to browse through the logged data, zoom in at specific parts and to display differentfeatures of the logged data. Each individual data point is called a click and each click consists offour different features that are discussed in detail in Section 2.2 and 5.2.2.AQUAClickView has an inbuilt harbour porpoise click classifier based on threshold parameters.These threshold parameters are selected manually by an analyzer who studies the clicks and se-lects the threshold parameters that best fit the logged data at hand. Selecting all-encompassingthreshold parameters can be difficult and it is often impossible to set up a satisfactory harbourporpoise classifier by using this method.

1.1.2 Limits of AQUAclick 100 and AQUAClickView

Harbour porpoises produce consecutive and time-connected clicks called click trains. An analyzerdistinguishes the clicks originating from a harbour porpoise from non-porpoise clicks by observinghow these time-connected clicks relate to each other. This aspect of classification is mostly lostin the inbuilt AQUAClickView classifier since each click is classified independently of each other.The AQUAClickView software does have additional click train classification abilities but these arelimited since they are dependent on initially classifying individual clicks. A better approach is toclassify time-connected clicks in click trains directly.There are also limitations to the AQUAclick 100 recordings. Harbour porpoise clicks can be missedout due to a weak signal and the recordings may contain reverbs that look different from the rest ofthe clicks. All of this makes using a threshold based classification approach even more impractical.An analyzer can detect these fallacies by observing clicks in click trains and make classificationsbased on pattern recognition. The analyzer also needs to make up for the frequently incorrectclassifications and manually study the data in order to correct them. This can often be a timeconsuming and tedious process.

1.1.3 Automation

The main goal of this thesis project is to create a classifier that is smarter and more reliable than theinbuilt AQUAClickView classifier. Another goal is to remove the need for an analyzer at least partly.By using machine learning not only can this be achieved, but it also makes possible the classificationof large datasets accurately and quickly. By automating the classification process the classificationtask becomes less time consuming and the classifications overall become more consistent. AnLSTM (long short-term memory) recurrent neural network is a machine learning algorithm that isdesigned for classification and prediction of time series and sequences. The harbour porpoise clicktrains are time series and are well-suited for classification with this method. It should be noted thatthe classifier developed in this thesis project should serve as a complement to the AQUAClickViewsoftware due to the viewing capabilities of the logged data that it provides. This thesis projectwas carried out at KTH Royal Institute of Technology and supported by Kolmården Wildlife Parkas a part of a master’s thesis work.

0

1.2 Previous workPorpoise click classification based on click trains has previously been done in the T-POD andC-POD hydrophones and their accompanying software. The C-POD is an improved successorof the T-POD produced by the same company, OSC Ltd. The click train classifier and detectorimplemented in the C-POD is at present the industry standard for porpoise and cetacean detection.The C-POD classifier is essentially a glorified threshold based classifier and does not currentlyemploy any machine learning techniques. The C-POD classifier is described by the producingcompany to be a black-box [1] and is closed source. There are however clear advantages to usingthe C-POD hydrophones as opposed to the AQUAclick 100. Not only is the inbuilt click trainclassifier and detector more accurate, the recordings by the C-POD hydrophone also contain moreinformation. The C-POD records high frequency data and also stores more useful data for eachindividual click.The reason for why the AQUAclick 100 is still used in this thesis project is that the data is easierto access and process. The AQUAclick 100 produces data that is good enough for classification toan analyzer in most cases. The AQUAclick 100 is also more affordable than the C-POD.

1.3 Ethical considerationsThe harbour porpoise is not considered to be a threatened species globally. In the Baltic seahowever the porpoise population has decreased drastically the last decades and is considered to beseriously threatened. Thousands of harbour porpoises in this region get stuck in thin fishing netsmade out of nylon every year [2]. To protect the porpoises the fishermen are advised to use devicesthat warn the porpoises of the presence of nets but not all fishermen follow this advice. Beingable to detect harbour porpoises could help with preserving the species and to observe changes inthe population size. It is also possible that a greater ability to detect harbour porpoises may leadto an increased amount of human interventions in their natural habitats. This could potentiallyfurther threaten the status of harbour porpoises in critical areas.

1

2 Harbour porpoises

2.1 General factsThe harbour porpoise is a type of toothed whale that resides in the North Atlantic, the North Pacificand the Black Sea. In Scandinavia its occurrence is most common in Skagerrak and Kattegat [3].The harbour porpoise produces click like sounds for echo navigation to localize itself and its pray.It is by detecting these click sounds that it is possible to identify harbour porpoises by passiveacoustic monitoring.

2.2 Click characteristics

2.2.1 Harbour porpoise clicks

The harbour porpoise produces short click like sounds for echo navigation. In this thesis projectthese clicks are recorded by the hydrophone AQUAclick 100. From the recordings only the keyfeatures of the clicks are logged to save space and to filter out silence and noise. The filtering is doneby passing the sound signal through a half-wave rectifier that functions as an envelope detectorfor each click as shown in Fig. 1. A click is considered to start and end while the amplitude ofthe enveloped click signal has reached a given threshold of 129 mV [4]. The frequency content ofthe enveloped click signals are calculated and logged, as well as the time between separate clicks,called the inter-click interval.

Figure 1: Illustration of the click detector implemented in the AQUAclick 100 with unspecifiedclick duration and amplitude. The green vertical lines represent the start and end of the clickduration. The horizontal red line represents the amplitude threshold for click detection. The bluecurve is the output of the envelope detector.

2.2.2 Click duration

The click duration is estimated by measuring the time for which the enveloped click signal is abovethe threshold of 129 mV. This implies that the actual click duration is likely to be longer if theenveloped click signal is weak. The click duration of a harbour porpoise click can last around1 − 800 µs but most commonly lasts around 150 − 300 µs as can be seen in Fig. 2. The clickduration can vary from one click to the other with an average standard deviation σ ≈ 93 µs withina given click train. The click durations for a typical click train are shown in Fig. 3.

2

Figure 2: Histogram of click durations from manually classified porpoise click trains with 50 binsand reverbs removed as described in Section 5.3 and 6.1.

Figure 3: Click durations for a typical click train. The variance of the click durations is low withinthe click train. Recorded in Middelfart, May 2017.

2.2.3 Frequency spectrum

The frequency spectrum of a harbour porpoise click is narrow-banded centered at 130 kHz as shownin Fig. 4. In contrast to other toothed whales the frequency does not vary considerably betweenclicks within a given click train. The click sound frequency is therefore the most characteristicfeature of harbour porpoise clicks. Despite this, filtering out clicks at 130 kHz is not on its ownenough to successfully identify harbour porpoise clicks due to high noise levels in the ocean. It isalso difficult to identify harbour porpoise clicks individually based on only the frequency content.It is therefore better to classify porpoise clicks based on entire click trains.

3

Figure 4: Power spectral density estimate of a harbour porpoise click recorded at Fjord & Bælt.The power spectral density was estimated using Welch’s method with a Hamming window of length128, 50 % overlap and a 1028-point DFT.

2.2.4 Amplitude ratio

To separate harbour porpoise clicks from noise, the AQUAclick 100 employs two bandpass filterscentered at 130 kHz and 60 kHz respectively. The harbour porpoise does not naturally produceclicks at 60 kHz as can be seen in Fig. 4. The amplitude at 60 kHz for each click is hence collectedto filter out noise. The amplitude outputs of each bandpass filter are compared by calculating theratio between the two amplitudes. This ratio called the amplitude ratio is a measurement usedto distinguish between harbour porpoise clicks and other click-like noise. Typically the amplitudeoutput from the 130 kHz filter is three times larger than the amplitude output from the 60 kHzfilter for a harbour porpoise click. This simply means that the typical amplitude ratio value is3. The amplitude ratio can vary within a harbour porpoise click train, but the amplitude ratiousually does not reach a value lower than 1 unless the signal-to-noise ratio is low. The amplitudeoutputs of the 130 kHz and 60 kHz band-pass filters for a typical click train are shown in Fig. 5and 6.

Figure 5: Amplitude outputs of the 130 kHz filter in the AQUAclick 100 for a typical porpoiseclick train. Recorded in Middelfart, May 2017.

Figure 6: Amplitude outputs of the 60kHz filter in the AQUAclick 100 for a typical porpoise clicktrain. The amplitude of the 60 kHz filter is low compared to Fig. 5. Recorded in Middelfart, May2017.

4

2.3 Click train characteristics

2.3.1 Inter-click interval

The measured time between separate clicks in a harbour porpoise click train is called the inter-clickinterval. This measurement vary across different click trains with a typical value around 1 − 149ms as shown in Fig. 7. Within a single click train however the variance is low and the inter-clickintervals are therefore the most important feature when classifying harbour porpoise click trains.The average standard deviation within a given click train is σ ≈ 16 ms. The inter-click intervalsfor a typical click train are shown in Fig. 8.

Figure 7: Histogram of inter-click intervals from manually classified porpoise click trains with 50bins and reverbs removed. Note that any value larger than 149 ms has been truncated to equal149 ms as described in Section 5.2.3 and 5.3.

Figure 8: Inter-click intervals for a typical porpoise click train. The variance of the inter-clickintervals within the click train is low. Recorded in Middelfart, May 2017.

5

2.3.2 Deviation from the median

One way to distinguish between harbour porpoise click trains and non-porpoise click trains is thelow variance of the inter-click intervals within a given harbour porpoise click train. Each inter-clickinterval within a click train can therefore be checked for how much it deviates from the medianvalue of the whole click train as shown in Fig. 9a and Fig. 9b. The reason as to why the medianvalue is used instead of the mean value is that it is better when trying to classify short click trains.A short click train that is still classifiable can contain as few as four clicks. One thing to note isthat the inter-click interval for the first click in the click train can take on any value larger than 149ms as described in Section 5.2.3. The first click is for that reason not included in the calculationof the median value.The click durations usually also have a low variance within a given harbour porpoise click train,albeit not to the same extent as for the inter-click interval. Compare Fig. 10a and Fig. 10b wherethe side lobes can be observed to be smaller in Fig. 10a. The deviation from the median for theclick duration can still give a hint of the correct classification and it is calculated in the same wayas for the inter-click interval.

(a) Porpoise (b) Non-porpoise

Figure 9: The deviation from the median for harbour porpoise and non-porpoise inter-click inter-vals.

(a) Porpoise (b) Non-porpoise

Figure 10: The deviation from the median for harbour porpoise and non-porpoise click durations.

6

2.3.3 Inter-click difference

It is also important to observe how the inter-click interval and the click duration changes fromone click to the next within a given click train. One common trait of the AQUAclick 100 datais that clicks can be missed out within click trains due to a weak signal or a high signal-to-noiseratio. This can be observed for the inter-click interval of one click to the next within a given clicktrain. The inter-click interval will then be approximately twice as large when one click has beenmissed and thrice as large if two consequent clicks have been missed out in the recording. As fornon-porpoise click trains the inter-click difference tend to be more or less random for both theinter-click interval and the click duration. This is not the case for harbour porpoise click trainsdue to the low variance of both these features within a given click train.

2.3.4 Time

The spacing of the clicks within a given click train is only partially expressed by the inter-clickinterval and the inter-click difference of the inter-click interval. The time refers to when a clickoccurred relative to the first click in the click train. This timing of occurrence generates informationabout how all features of a click train have changed over time. It also provides insight into how longthe click train has lasted. Most harbour porpoise click trains will last for around 50 millisecondsto a few seconds. For non-porpoise click trains this figure tends to be more random.

7

3 Machine learning

3.1 Supervised and unsupervised learningSupervised and unsupervised learning are two main categories of machine learning. For supervisedlearning a dataset of corresponding pairs of inputs and desired outputs of the machine learningalgorithm is utilized. The designer of a supervised machine learning algorithm uses the datasetto train the algorithm to perform in a desired fashion. The performance of the machine learningalgorithm is evaluated and optimized by the designer through adjusting controllable parametersof the algorithm. The performance optimization is carried out during training on a subset ofthe dataset, called the validation data, as described in Section 3.3. The final step of supervisedlearning is to test the performance of the algorithm on data independent of both the trainingand the validation data. Supervised learning hence requires a designer who carefully designs themachine learning algorithm and controls the performance.Unsupervised learning on the other hand does not require a designer. An unsupervised machinelearning algorithm instead extracts information directly from the input data and makes its ownpredictions for classification and/or categorization. This is usually done by clustering similar datatogether. The advantage of using a supervised learning approach is that it gives the designerthe control to optimize the performance of a machine learning algorithm for a specific task. Asupervised machine learning algorithm can thus be constructed to have a function and performanceas close to ideal as possible.

3.2 Bias-variance trade-offThe bias and variance are two sources of error in a supervised machine learning algorithm [5]. Ifa machine learning algorithm is biased it has made incorrect assumptions from the training data.The algorithm then consistently makes incorrect outputs from given inputs of test data.A machine learning algorithm with high variance on the other hand is sensitive to noise or thevariance within the training data. This causes the algorithm to make incorrect outputs due tosmall variations in the test data. Both of these sources of error are undesirable but making thealgorithm less biased causes the variance to increase and vice versa. There is therefore a trade-offthat has to be made between the bias and the variance when training, evaluating and optimizinga supervised machine learning algorithm.

3.3 Training, validation and test dataIn the case of supervised machine learning, a dataset of corresponding pairs of inputs and desiredoutputs is initially collected. This dataset is divided into three different groups of data - trainingdata, validation data and test data. The training data is used to make the machine learningalgorithm function in a desired fashion by training the algorithm. The validation data is usedduring the training process to evaluate the performance of the algorithm. It is important to notethat the validation data is independent of the training data. The independence of the validationdata allows for providing insight as to how the algorithm would perform in a "real-world" situation.During training the designer of a supervised machine learning algorithm optimizes the performanceand function of an algorithm by evaluating its performance for both the training data and thevalidation data as explained in Section 3.4. Once the machine learning algorithm is trained andoptimized for the training and validation data, it needs to be tested in an actual "real-world"situation. This is when the test data is used. The test data is independent of both the trainingdata and the validation data. The test data is used for a final evaluation of the performance of thealgorithm. If the performance on the test data is unsatisfactory, the training process needs to beredone. To clarify the difference, validation data is used to test the performance during trainingand test data is used to test the performance of the algorithm after the training is finished. Boththe validation data and the test data is independent from the training data.A common way to split the dataset is 60-20-20 [6]. This means that 60 percent of the total datasetis assigned to training data, 20 percent to validation data and the remaining 20 percent to testdata. The 60-20-20 split is used in this thesis project.

8

3.4 Overfitting and underfittingA major obstacle to overcome in a machine learning task is overfitting. Overfitting occurs when amachine learning algorithm becomes too well adapted to the training data during training and isno longer able to function reliably for data outside of the training data. Overfitting is essentiallythe same as high variance. High variance is described in more detail in Section 3.2. A popular wayto visualize overfitting is what occurs when a high order polynomial is used to fit a small numberof data points as shown in Fig. 11 [7]. This should be in contrast to using a simpler model to fitthe data points to. Overfitting is in this way caused by overgeneralizing.

Figure 11: Two polynomials fit to data points generated by a straight line with added Gaussiannoise. The blue polynomial has generalized well while the red polynomial is overfit.

Overfitting is undesirable since it cripples the general performance of the machine learning algo-rithm. This is why the performance needs to be evaluated for data independent of the training data(validation data) during training. Even in the case that validation data is used during training,overfitting can occur if the validation data is too similar to the training data. Test data musttherefore be used to ensure good performance outside of the training data and the validation dataafter training. Another important way to avoid overfitting is to ensure that the dataset capturesall aspects of the desired performance in all possible situations. Both the quantity and quality ofthe data is thus important. Incorrectness in the dataset causes overfitting and lack of data makesthe algorithm overgeneralize for the few available data points.The opposite of overfitting is underfitting. This occurs when a machine learning algorithm does notperform well even for training data. Underfitting can be solved by choosing a better and possiblymore complex model that better fits the training data. The problem of overfitting is usually morecommon and more difficult to solve for than underfitting.

3.5 FeaturesFeatures are what is used to capture characteristics in the dataset. The features are also what themachine learning algorithm learns and extracts information from during training. Features can beresembled to cognitive hints. For example can color, shape and size be a way to distinguish betweenan apple, a pear or any other fruit. In this simple example the features are simply color, shape andsize. Choosing the right features that best distinguish the differences within a dataset is one of themost important tasks for a designer of a supervised machine learning algorithm. Choosing too fewfeatures can result in underfitting. Choosing too many on the other hand can result in overfittingand/or slow and inefficient performance. The quality of the features has a high correlation withthe overall performance of the algorithm in terms of both accuracy and speed. In the exampleof distinguishing between an apple and a pear, only using the feature of color would cause thealgorithm to confuse green apples and pears. The algorithm becomes underfit due to the lack of

9

distinguishing features. Adding a low quality feature like the number of worms inside the fruit doesnot usually provide any additional information. This feature would in most cases be redundantand including this feature could potentially slow down the algorithm. But it is also possible toimagine a case where all apples during training had worms and the pears did not. The algorithmwould hence make the mostly false assumption that worms ⇐⇒ apple and no worms ⇐⇒ pear.The algorithm becomes overfit due to this redundant feature.

3.6 TruncationOverfitting can also be caused by outliers in the data. To use the same example of distinguishingapples and pears in Section 3.5, an unusually small apple could incorrectly be confused for a peardue to its unusual size. This might be because pear =⇒ small size for the most part in thisexample, but small size ⇐⇒ pear does not hold true. This can be solved by regulating thesize of all fruits in order to avoid for outliers to even occur. This regulation is called truncating.Truncation is an operation of limiting the data features to contain outliers and is often done bysetting thresholds. Truncating can also be used to set a limit for the information contained in athreshold. In the apple and pear example, this could be that an extremely green pear is not morelikely to be a pear compared to the case of a normally green pear. If the training data has lowquantity and quality this might also cause the normally green pears to be incorrectly classifieddue to their lack of greenness. The level of greenness only provides distinguishing information toa certain extent.

3.7 NormalizationIt can be beneficial to make the values of a dataset feature F have set boundaries, for example F ∈(0, 1) or F ∈ (−1, 1). This will make the input value boundaries of feature F of the dataset the sameas the output in some machine learning algorithms, for example LSTM recurrent neural networks.This causes faster convergence and processing time. This is because most neural networks use socalled activation functions, like the sigmoid or the tanh functions, that makes the outputs insidethe network continuous and differentiable as to enable backpropagation as described in Section4.6.2. This type of normalization is usually done by dividing all values in F by the maximum valuein F .High variance within the training data can also cause overfitting. This implies that the exactvalues of the features are not important but their relative relationships are. This can be solvedby normalizing the subset values in F by the mean of the same subset values in F or a similarapproach. The latter type of normalization is not used in this thesis project.

10

4 LSTM recurrent neural networks

4.1 MotivationLSTM stands for Long Short-Term Memory. An LSTM recurrent neural network is a supervisedmachine learning algorithm that is designed to learn from contexts in time series and rememberingpassed inputs. It was first proposed by Hochreiter and Schmidhuber in their paper on Long shortterm memory published in 1997 [8] and further extended by Schmidhuber et al. in 1999 [9]. AnLSTM recurrent neural network is ideal for the case of harbour porpoise click train classificationsince the click trains contain clicks that change over time. The time dependency and context of theclicks are crucial for distinguishing harbour porpoise click trains from non-porpoise click trains.

4.2 UnitsThe LSTM network is built up by a set of connected units [10]. Each unit receives both newinput and the output from the previous connected units as shown in Fig. 12. The unit processesthe information by passing it on to gates that regulate what information will be forgotten orremembered. Each unit consists of three separate gates: a forget gate, an input gate and anoutput gate. In addition to the gates, each unit contains tanh functions that regulate the outputsizes. The gates and tanh functions are represented as boxes in Fig. 12. The unit also receivesinformation from a memory cell that collects information from all previous units and passes it onto the next one. Each of the gates are sigmoid functions that outputs a value between 0 and 1.The outputs from the sigmoid functions decide how much of the information will be passed on tothe memory cell.

Figure 12: One unit of an LSTM recurrent neural network. The green line represents the flow ofthe memory cell throughout the unit, the blue line represents the input to the current unit andthe orange line represents the output of the previous unit. To the right in red is the final outputof the unit.

4.3 The memory cellThe memory cell contains information about previous inputs and passes it on to the next unit.In each unit new information may be added or removed from the memory cell. This allows forfinding longtime dependencies between data inputs over time. The memory cell gets changed andupdated for each pass between units. The memory cell is therefore unique for each unit but ischanged depending on the inputs in previous cells. The dimension of the memory cell is the sameas that of the output of the unit. The dimension of the memory cell can be different from that ofthe input. This is important since this allows the LSTM network to output sequences longer orshorter than the input.

11

4.4 Weights and biasesBefore an input is passed on to a gate it is always first multiplied by a weightWk and summed witha bias term bk. The weights and biases are shared from unit to unit. The memory cell and everygate within a unit has a unique set of weights and biases. The weights and biases are essential forthe function of the network and are what decides how the input data to the network is going to beprocessed. The weights and biases are both learned during the training phase of the network.

4.5 Forget, input and output gates

4.5.1 The forget gate

The input xt of the current unit and the output from the previous unit ht−1 are first sent to theforget gate fg as shown in Fig. 13. The forget gate determines what information in the previousmemory cell Ct−1 will be stored and passed on to the next unit, and what information will beforgotten. The output values from the forget gate are between 0 and 1.

fg =1

1 + e−(Wf [xt,ht−1]+bf )

In most cases both xt and ht−1 are vectors. Inside the forget gate these two vectors get con-catenated to [xt, ht−1]. This implies that if xt is a vector of size n× 1 and ht−1 is a vector of sizem × 1, the resulting concatenated vector is of size (n +m) × 1. The memory cell Ct−1 is also avector of size m × 1, which is the same size as for ht−1 as described in Section 4.3. To keep thedimensionality consistent, the weight matrix Wf is of size m× (n+m) and the bias is a vector ofsize m × 1. The resulting output of the forget gate is hence a vector of size m × 1. This vectoris then multiplied entry-wise with the previous memory cell Ct−1. The resulting vector Cstored isagain a vector of size m× 1.

Cstored = fg ◦ Ct−1

Since each entry in the output vector of the forget gate fg is between 0 and 1, an output of 0will not let any information stored in the corresponding entry in the memory cell Ct−1 through,while an output value of 1 will let all of it through.

Figure 13: The input xt and previous output ht−1 are passed into the forget gate fg. The outputof the forget gate is then multiplied entry-wise with the previous memory cell Ct−1. This resultsin the memory cell Cstored.

12

4.5.2 The input gate

Next the input of the current unit xt and the output from the previous unit ht−1 are passedthrough a tanh function that maps all values to be between -1 and 1 as shown in Fig. 14. Theoutput Cupdate of the tanh function can be viewed as a temporary memory cell that will be addedto the final memory cell.

Cupdate = tanh(Wc[xt, ht−1] + bc)

where the dimension of Cupdate is m × 1, the dimension of the weight matrix Wc is m × (n +m)and the dimension of the bias bc is m× 1.

Figure 14: The input xt and previous output ht−1 are passed through a tanh function. The outputof the tanh function Cupdate will later be scaled and added to the memory cell.

The input of the current unit xt and the output from the previous unit ht−1 are then sent to theinput gate as shown in Fig. 15. This gate decides how much of the input and the previous outputwill be added to the memory cell. The output values of the input gate ig are again values between0 and 1.

ig =1

1 + e−(Wi[xt,ht−1]+bi)

where the dimension of ig is m× 1, the dimension of the weight matrix Wi is m× (n+m) and thedimension of the bias bi is m× 1.

Similarly to the forget gate, the output values of the input gate are multiplied entry-wise with theoutput of the tanh function, or the temporary memory cell, to decide how much of it will be passedon to the memory cell as illustrated in Fig. 15. The final result is then added to the memory celland this updated memory cell Ct will then be passed on to the next unit.

Ct = Cstored + ig ◦ Cupdate

where the dimension of Ct is m× 1.

13

Figure 15: The input xt and previous output ht−1 are passed into the input gate. The output ofthe input gate ig is then multiplied element-wise with the temporary memory cell Cupdate. Thefinal result is then added to what is still stored from the previous memory cell Cstored to createthe current memory cell Ct.

4.5.3 The output gate

Finally the input of the current unit xt and the output from the previous unit ht−1 are passed intoan output gate as shown in Fig. 16. The output gate og basically has the exact same function asthe previous gates, it is a sigmoid function that outputs a vector containing values between 0 and1 that determines how much information will be let through.

og =1

1 + e−(Wo[xt,ht−1]+bo)

where the dimension of og is m × 1, the dimension of the weight matrix Wo is m × (n +m) andthe dimension of the bias bo is m× 1.

Next the output of the output gate is multiplied entry-wise by the current memory cell. Beforethis multiplication the current memory cell Ct is passed through another tanh function that mapsall values within the memory cell to be between -1 and 1. The final output ht of the current unitis hence a scaled version of the current memory cell Ct.

ht = og ◦ tanh(Ct)

where the dimension of ht is m× 1.

Figure 16: The input xt and previous output ht−1 are passed into the output gate og. The outputof the forget gate is then multiplied entry-wise with the scaled current memory cell Ct. The resultis the final output ht.

14

4.6 Training the network

4.6.1 Loss function

The network has to be trained to obtain the right weights and biases to be used in each functionand gate that exist within the network. The first part of training the network except collectingtraining data is choosing a loss function. The loss function is the way in which the network istested to be functioning correctly or not. It is a measure of the error. In binary classification acommon loss function is the cross entropy loss function

L(x, y) = −y log(H(x))− (1− y) log(1−H(x)), H(x) ∈ (0, 1), y ∈ {0, 1}

where y is the desired output and H(x) is the output of the network given the input x.

While training, the network is provided input data and the correct classification of the inputdata. The loss function takes the output, or rather the predicted classification, of the network andcompares it to correct classification. The loss function will then reward the network if the predictedclassification is correct and punish the network if it is incorrect. The weights and biases in thenetwork are updated depending on how correct the predicted classifications are by using backprop-agation. One important aspect to note about loss functions in recurrent neural networks is theirnon-convexity [11]. There is therefore a risk to get stuck at a local minimum while minimizing forthe loss/error.

4.6.2 Backpropagation

The main idea behind backpropagation is derivation using the chain rule [12]. The reason for usingderivatives is to find out how each weight Wk and each bias bk in the network affects the lossfunction. These derivatives can be expressed as

∂L

∂Wk=∂L

∂H

∂H

∂Wk,

∂L

∂bk=∂L

∂H

∂H

∂bk

Backpropagation works by starting at the end of the network where the derivative is straightforward to compute and then following along the network backwards to calculate how each part ofthe network is affecting the loss function.

4.6.3 Gradient stochastic descent

After using backpropagation it is known how all the current weights and biases affect the loss func-tion. The next step is to update the weights and biases so that the network can perform betterand to minimize the loss. This is done with gradient stochastic descent by updating the weightsand biases using the derivative of the loss with respect to the weight or bias. This is achieved withthe equations

Wk :=Wk − η∂L

∂Wk, bk := bk − η

∂L

∂bk

The loss should be as small as possible and by subtracting the derivatives ∂L/∂Wk and ∂L/∂bkfor each weight and bias multiplied by a small value η the weight and bias will approach minimumloss. Due to the non-convexity of the loss function it is not guaranteed that the gradient descentreaches the global minimum.

4.6.4 Adam

Adam is a variant of gradient stochastic descent that is commonly used in training neural networks.The advantage of Adam over regular gradient stochastic descent is its faster convergence.In contrast to stochastic gradient descent, Adam uses momentum and an adaptive learning rateto make the convergence faster. The parameters used for stochastic optimization in this thesisproject are the same as the default settings provided in the original paper on Adam published in2014 [13].

15

4.6.5 Batch and epochs

The amount of data in a batch decides how many samples of the training data will be used atonce to train the network. One difference between training by using a small or a large batch is theamount of memory that is used by the computer while training. Another difference is that withlarger batches the gradients (or derivatives) become more accurate [14]. If the batch is large thetraining will be faster and the computer will use a larger amount of memory. In the same way asmall batch will require a smaller amount of memory and the training will be slower. An epochon the other hand is when all of the training data has been used once to train the network. Thenetwork can be then be trained again using the same training data and improve the performanceby updating and improving upon previously obtained weights and biases. The downside to trainingthe data over too many epochs is that overfitting is likely to occur.

4.7 Stacked networksA stacked network is simply a LSTM network stacked onto another LSTM network. The outputsfrom the first network will be the input of the one stacked onto it. This way it is possible to findeven more complex dependencies between the inputs.

16

5 Click train extraction and processing

5.1 Extracting click trainsThe click trains are initially extracted by grouping clicks that have an inter-click interval at max-imum 149 ms apart from each other. The click groups that contain less than three clicks are notconsidered for classification and are therefore filtered out. If an extracted click train contains lessthan three clicks it is not considered for classification since it is too short for the classification tobe reliable. It is unlikely that the click train originates from a harbour porpoise if it is very short.The average number of clicks in a positively classified harbour porpoise click train is 15-16 clicks.The 149 ms figure is selected based on research on harbour porpoises [15] and chosen for its goodseparation of click trains with at least four clicks as described in Section 8.1.

5.2 Preprocessing

5.2.1 Reshape for input

In order to use an LSTM recurrent neural network with a fixed amount of units, each click trainis first processed to fit into it by either zero-padding or splitting. If a click train exceeds 20 clicksit is split into several parts. For example, if a click train contains 46 clicks it will be split intothree parts. Two parts containing 20 clicks and one part containing 6 clicks. The last part, or theremaining "tail" of the click train, is then zero-padded with 14 zeros in order to make it fit into thenetwork. If the remaining "tail" is less than 4 clicks long it will be removed due to classificationdifficulties of very short click trains. It is important to differentiate between click trains and clicktrain parts since each click train part within a click train will be classified independently of eachother.

5.2.2 Features

There are four features that are obtained for each click by AQUAclick 100. These are the clickduration, the inter-click interval, the amplitude output from a 60kHz filter and the amplitudeoutput from a 130 kHz filter. From these key click features the remaining click train features arecalculated as discussed in Section 2.3. These features are calculated separately for each click trainpart with a size of maximum 20 clicks. This is done because each click train part is independentlyclassified. The independent classification prevents the algorithm from incorrectly classifying partsof unusually long click trains. This is because long click trains may contain non-porpoise clicktrain parts. There are five click train features calculated from the key click features. These are thedeviation from the median of the inter-click interval and the click duration, the inter-click differenceof the inter-click interval and the click duration, and the time. By including the amplitude ratio,there are 10 features in total that are used to describe each click in a given click train.

5.2.3 Truncation and normalization

To avoid overfitting due to outliers in the dataset some feature values are truncated before trainingand classification. The features that are truncated are the inter-click interval, the click durationand the amplitude ratio. The inter-click interval is especially important to truncate due to therandom properties of the first click in an extracted click train. This first click can take on anyinter-click interval. This is because the most recent click prior to the first click in an extractedclick train can have occurred at any point in time. All inter-click intervals are therefore truncatedto 149 ms which indicates the start of a click train. As for the click duration and amplituderatio, values above a certain threshold do not provide any additional information about the click.Instead they only add outliers to the data that can cause overfitting. In the case of click durationthe threshold is set to 600 µs since it is uncommon for harbour porpoises to produce clicks thatlast longer durations. As for the amplitude ratio, the threshold value is set to 3. This means thatif the 130 kHz filter output amplitude is more than three times larger than that of the 60 kHz filteroutput amplitude, the amplitude ratio does not provide any additional clues about the correctclassification. All amplitude ratio values above 3 are therefore set to 3.It is a common practice in machine learning to normalize the data within a dataset to make theconvergence of the algorithm faster. This is also done in this thesis project for optimizing the

17

LSTM network. The normalization is simply done by normalizing all values for each feature F inthe dataset to lie in the interval F ∈ (0, 1).

5.3 Reverb detectionThe recordings by AQUAclick 100 often contain unwanted reverbs. The reverbs are caused by theclick sounds bouncing off nearby objects or the sea floor. The reverbs are not real clicks and itis therefore beneficial to detect them and to remove them. In this thesis project the reverbs arenot removed before classification. Instead the training is made with training data that includesall reverbs. To remove reverbs before classification could cause the processing stage to modifynon-porpoise click trains to resemble porpoise click trains. This could potentially cause incorrectclassifications. The number of reverbs is often an additional indication of the correct classification.Reverb detection is still done in this thesis project for removing unwanted reverbs after classifica-tion.The inter-click interval of a reverb click is in almost all cases an outlier compared to all otherinter-click intervals within the same click train part n. The deviation from the median T

(n)median

of the inter-click interval T (m,n)ici of a given click m is for a reverb click characteristically large as

explained in Section 2.3.2. Since every median value for the inter-click interval is different thedeviation from the median is normalized by dividing by the median. This normalized deviationfrom the median for the given inter-click interval is checked against a threshold Tthreshold. If thefinal value is larger than the threshold value then the click is classified as a reverb.

T(m,n)ici − T (n)

median

T(n)median

≥ Tthreshold

18

6 Click train classification

6.1 Manual classificationThe manual classification was an iterative process that was helped by the automated classification.The first step of the manual classification was done by first implementing a click train extractionalgorithm in MATLAB as described in Section 8.1. The analyzer was presented with all theextracted click trains that had been extracted in sequence and was then asked by the program toclassify the presented click trains one by one. A first database of correct classifications was thuslycreated from the analyzer’s input. During the first step of the manual classification process around1500 click trains were classified. This first database of classified click trains was then fed into theLSTM recurrent neural network and the LSTM network was trained for a first time as describedin Section 6.2.4. New input data was then fed into the trained LSTM network and classified bythe network. The classifications produced by the network were then scrutinized and correctedmanually until the final dataset was created.There are two reasons for this iterative approach. One reason is that the classifications of thenetwork were good enough already at an early stage so that it was less time consuming to correctthe network classifications than classifying the extracted click trains from scratch. The secondreason was that the iterative process helped out when training the LSTM network by fixing itsmistakes and thus improving its performance for each iteration.

6.2 Automated classification

6.2.1 Dataset

The dataset consists of 3114 manually classified click train parts collected from Strib, Denmark.This dataset was created iteratively as described in Section 6.1.

6.2.2 LSTM network structure

The final network structure is shown below in Fig. 17.

Figure 17: The final stacked LSTM network structure. It consists of three layers. The first andsecond layers are LSTM networks with 20 units each. For the first layer each unit represents aclick in a click train. The third and last layer is a sigmoid function.

The input to the final network is called a click train part and it is a 10× 20 matrix of values. TheLSTM network is fed by all click train parts in the data to be classified in an array of matricescalled a tensor. There are therefore two important input dimensions data dimension × time step.In the first layer the data dimension describes how many features each click have and the timestep describes how many clicks there are in a click train. The structure of the LSTM recurrentneural network applied in this thesis project is a stacked network. The input dimension is 10× 20

19

for the first network and 22× 20 for the second network. The output of the first network is hencea sequence that is fed into the second network. The output dimension of the second network is11 × 1 and is in turn fed into a sigmoid function that gives the final 1-dimensional classificationoutput of a click train. The final network structure was selected on a trial-and-error basis thatyielded the best classification results.

6.2.3 Network layers

The first layer is an LSTM network and each unit in the first layer receives an input that representone click. The input contains all 10 features for one click as described in Section 2.2 and 5.2.2.The output of each unit in the first layer is a vector of length 22. The same output vector is alsopassed on to the next unit as can be seen in Fig. 18. For the second LSTM layer each unit receivesthe output from the first layer which is a vector of length 22. The output from the second layer isa vector of length 11. It should be noted that there are actually two outputs that are passed on tothe next unit in the first and second layer. These are the unit output and the memory cell. Thedimension for both of these outputs are the same as explained in Section 4.3. The third layer isa sigmoid function σlayer containing a weight vector wσ of size 1× 11 and a bias bσ of size 1× 1.This layer forces the final output to be of size 1× 1 and to be a likelihood value between 0 and 1.

σlayer =1

1 + e−(wσ·xt+bσ)

where xt is the input to the third layer and has the dimension 11× 1.

Figure 18: Input and output dimensions of the three layers. Illustrated in the figure from left toright are; two LSTM units from the first layer, two LSTM units from the second layer and thesigmoid function from the third layer.

6.2.4 Training

The training is done by feeding the training data and its correct classifications into the LSTMnetwork. For each click train part the predicted classification made by the LSTM network iscompared to the correct classification by the means of the loss function. By using backpropagationthe derivate of the loss function with respect to each weight and bias is calculated as describedin Section 4.6.2. The weights and biases are updated to minimize the output value of the lossfunction by the means of these derivatives and stochastic gradient descent. In this thesis projecta variant of stochastic gradient descent Adam is used as described in Section 4.6.4. Once this isdone for all click trains in the dataset the training has finished one epoch. The number of epochsthat is made to train the network is decided manually in advance. Too many epochs will resultin overfitting and too few will result in underfitting as explained in Section 3.4. For each epochthe accuracy and loss (the output value of the loss function) is evaluated. This evaluation foreach epoch of accuracy and loss is done for both the training data and the validation data. Thedifference between training data and validation data is further explained in Section 3.3.

20

6.2.5 Evaluation

Each time the network is trained one additional epoch the weights and biases are updated tominimize the loss of the training data. The way to evaluate and select the most appropriate weightsand biases for the LSTM network is to observe when the LSTM classification is performing well forboth the training data and the validation data. The performance at this stage is measured by howlow the loss is and how high the accuracy is. The training should be stopped at the epoch for whichthese two parameters are optimized. The evaluation and training is thus an iterative supervisedprocess where the network structure and number of epochs are selected based on the performance.The quality and quantity of the training, validation and test data also play an important role.Correcting the output of the network as described in Section 6.1 is another important aspect.

21

7 Implementation setup

7.1 AQUAClickViewThe AQUAclick 100 stores the logged data to .bin files. Since the encoding of these data files wereunknown during this thesis project AQUAClickView had to be used for decoding. The .bin files areloaded into AQUAClickView that consequentially allows the user to make an export of the loadeddata to readable .csv-files. These .csv-files that can then be processed for click train extractionand classification by the algorithm developed in this thesis project. The ideal would have been todirectly input the .bin file into the developed algorithm, but the process of exporting the .bin-fileto a .csv-file through AQUAClickView is straight-forward and easily done.

7.2 MATLABMATLAB is a computer program that allows for advanced mathematical calculations and datavisualization. The most useful feature is how MATLAB is able to store data in vectors andmatrices. It is essentially a simple programming language based on linear algebra with powerfulinbuilt processing tools. The advantage of MATLAB is the easy to use interface and the processingtools. The disadvantage of MATLAB is the lack of machine learning tools and inefficient processingof large datasets. This thesis project was first started to be implemented in MATLAB but thedevelopment was later moved to Python due to these limiting factors.

7.3 Python and Python libraries

7.3.1 Python

This classification algorithm was implemented in Python for several reasons. Primarily because ofthe wide range of open-source libraries available for Python. Python is also considerably faster andmore versatile compared to, for example, MATLAB. Python additionally allows for large datasetsto be processed fast.

7.3.2 Keras and Theano

Keras is an open-source machine learning library developed for Python. It is the LSTM recurrentneural network implemented in Keras that has been adopted in this thesis project to classifyharbour porpoise click trains. Keras is developed on top of two other machine learning open-source libraries by the names of "TensorFlow" and "Theano". In this thesis project the "Theano"backend is used due to the ease of implementation in Windows.

7.3.3 Numpy

Numpy is an open-source library for Python that allows for mathematical operations and vector-ization of data. It makes it possible to deal with data in a similar way to MATLAB. The input datato the machine learning algorithm used in this thesis project is a tensor and the Numpy librarymakes it possible to arrange the data in this way.

7.3.4 Pandas

Pandas is an open-source library for Python that enables for reading data files. It is used in thisthesis project in order read the data files in .csv format that the program AQUAClickView outputs.These .csv files contain all the filtered recordings made by AQUAclick 100. From these files clicktrains are extracted and classified.

7.3.5 Matplotlib

Matplotlib is an open-source library for Python that enables plotting figures. Matplotlib is used inthis thesis project for plotting click trains with corresponding features and predictions, and generalresults.

22

7.3.6 Tkinter

Tkinter is another open-source library for Python that allows for constructing a graphical userinterface (GUI). The Tkinter library was used to create a GUI that allows for easy employment ofthe porpoise classification tool developed in this thesis project.

23

8 Implementation

8.1 MATLAB

8.1.1 Extraction

The first part was implementing a click train extractor in MATLAB. Here the choice of 149 msas a maximum inter-click interval was chosen based on studying how many click trains could beseparated and extracted from the input data set. It could be observed that for this particulardataset the maximum number of click trains containing at least 4 clicks could be obtained with the149ms inter-click interval. An inter-click interval of 149ms is not only the value that maximizes theclick train number for a particular dataset, it is also less common for harbour porpoises to produceclicks with a longer inter-click interval. Choosing the right inter-click interval is important forboth separating click trains correctly and to collect as many porpoise-like click trains as possible.The click trains need to be separated correctly in order to avoid merging two or more click trainstogether. Merged click trains can be troublesome in the case that one part of the extracted clicktrain is a harbour porpoise click train and another part of the same extracted click train is difficultto classify correctly. It can also be a problem when an extracted click train is very long whichalso raises difficulties for making a correct classification. Missing out on click trains that couldpotentially be harbour porpoise click trains is clearly not desirable since a click train that has notbeen extracted will never be classified.

8.1.2 Manual classification

After the function for extraction had been implemented, another function for manual classificationwas created. This function shows the user one extracted click train at the time and asks the userfor a binary classification input, 1 for harbour porpoise and 0 for non-porpoise. The click trains arepresented to the user by displaying the inter-click interval, the click duration, the 130 kHz and 60kHz output amplitudes and the amplitude ratio. Directly after the classification of one click train,the user is presented the next extracted click train for classification. This process was first donefor a smaller amount of click trains, but the final classified dataset contains around 1500 manuallyclassified click trains.

8.2 Python

8.2.1 Training

Implementation was switched to Python from MATLAB due to the lack of machine learninglibraries and unsatisfactory file reading capabilities in MATLAB. Machine learning libraries wereat the time of this thesis project limited in MATLAB and MATLAB was not able to process thelarge .csv files output by AQUAclickView. In Python these two tasks could be handled by Kerasand Panda. Keras allowed for implementing LSTM recurrent neural networks that could be trainedusing the manually classified data.

8.2.2 Graphical user interface

One goal of this thesis project was to develop an easy to use program for porpoise classification.The graphical user interface (or GUI for short) lets the user decide what AQUAclick data file toread and in what folder the output files should be saved. It also allows the user to decide whatdata files, plots and sound files that the program should produce and save. Another importantoption that must be selected by the user before classification is possible, is the classification level.All options are chosen by check boxes inside the GUI as shown in Fig. 19.

24

Figure 19: The GUI of the porpoise click train classifier program implemented in Python.

8.2.3 Classification levels

The classification level specifies the threshold for which the classification will be either negative orpositive. For each click train part the classification algorithm outputs a likelihood value between0 and 1. The closer the value is to one, the more certain the algorithm is that the particular clicktrain part is positive. The natural classification level would be a value above 0.5 since the algorithmestimates the likelihood to be more than 50 % for a positive classification. There are however someother factors that need to be taken into consideration before selecting a good classification level.These are precision and recall which are explained in detail in Section 9.1.1. The user can choosebetween three distinct levels in the GUI: "strict", "moderate" and "loose". The three levels arebased upon optimizing between precision for three different levels. The user can also choose to inputtheir own threshold between 0 and 1. This allows the user to experiment with the classificationalgorithm. The main intended usage for the manual threshold option is to enable the user to setthe threshold very high (e.g. to 0.995) to create a super strict classification algorithm.

25

8.2.4 Click train data

The user can decide to output three click train data files. These files are .csv files containingeither all extracted click trains without any likelihood data attached, all extracted click trains withlikelihood data attached and all positively classified click trains. It should be noted that for the filecontaining all extracted click trains with likelihood data, the likelihood data is repeated for everyclick in a certain click train part. This does not mean that every click has been independentlyclassified. The classification is always done independently for each click train part and not forevery click. There might in other words exist specific clicks within a given click train part thatare more or less likely to be a harbour porpoise click. This variation for each click within a givenclick train part is not evaluated and instead every click train part is assigned one likelihood thatreflects the overall likelihood for all clicks within the click train part. In addition the GUI alwaysoutputs a text file containing the main classification information. It specifies the total number ofclick trains and click train parts extracted, the total number of positively classified click trains andclick train parts and the utilized classification level and threshold.

8.2.5 Click train plots

The GUI can output either general result plots, click train plots or both. The click train plotscan also be narrowed down to saving all extracted click trains or only positively classified clicktrains. For each click train two individual plots are produced for every extracted or positivelyclassified click train. The first of the plots produced for every click train displays the start time ofrecording for the selected AQUAclick data file, the time at which the current click train occurredsince the start time, the inter-click interval for every click within the click train, the click durationfor every click within the click train, the amplitude ratio for every click within the click train andthe prediction/likelihood for every click train part of the click train as shown in Fig. 20. Thismeans that depending on how many clicks the current click train contains there are fewer or morelikelihoods displayed. If a given click train consists of more than one click train part, the beginningof every new click train part is indicated with a black vertical dotted line in the inter-click interval,click duration and amplitude ratio subplots within the plot. It is also important to note the threehorizontal dotted lines in the prediction subplot. These lines represent the three classification levelsthat can be selected in the GUI. The green line represent the "strict" classification level threshold,the black line represents the "moderate" classification level threshold and the red one representsthe "loose" classification level threshold. Additionally, all clicks that have been classified as reverbsare colored green in both the first and second click train plot.

Figure 20: The first plot of a porpoise click train generated by the click train classifier programimplemented in Python. Recorded in Middelfart, May 2017.

26

The second click train plot is similar to the first plot. In the second plot, as shown in Fig. 21,the amplitude output of the 130 kHz filter and the amplitude output of the 60 kHz filter for everyclick within the current click train is displayed instead of the click duration and amplitude ratio inthe first plot. The motivation for plotting a large portion of the data twice is to enable the user toeasily differentiate which two click train plots are displaying different features from the same clicktrain. The inter-click interval is commonly the most characteristic feature for a given click trainand it is therefore displayed in both separate plots. The prediction is displayed in both plots aswell since it allows the user to see the relationship between the likelihood estimations and all theclick train features. Other click train features such as the inter-click difference and the deviationfrom the median are not displayed in the click train plots since they are usually more difficult tointerpret.

Figure 21: The second plot of a porpoise click train generated by the click train classifier programimplemented in Python. Recorded in Middelfart, May 2017.

The general result plots are two plots that displays the number of positively classified clicks con-tained in the whole input AQUAclick data file. Thus these plots focus on the number of positivelyclassified clicks contained within the positively classified click train parts. Positively classified clicksthat also have been classified as reverbs are removed and not considered as positively classifiedclicks in the general result plots. The first of these plots shows the total number of positivelyclassified clicks for each hour over the whole recording time as displayed in Fig. 22. All positiveclicks that occurred within the same hour are counted for every hour of recording and a bar graphis produced from the result.

In the second plot each minute that contains a positive click within each hour is counted andit is called a positive minute as displayed in Fig. 23. If every minute in a given hour is a positiveminute then that hour would have 60 positive minutes which is the maximum number of positiveminutes an hour could have. Counting positive minutes can therefore be viewed as a normalizationof the first plot since it is independent of the total number of positive clicks and focuses on theclick density over time. The second plot is also a bar graph and displays the number of positiveminutes per hour.

27

Figure 22: Porpoise click trains per hour for data recorded in Middelfart, May 2017.

Figure 23: Porpoise positive minutes per hour for data recorded in Middelfart, May 2017.

8.2.6 Click train sounds

The human ear is good at pattern recognition and it is of interest to make the extracted clicktrains audible. A harbour porpoise produces clicks at 130 kHz which is beyond the common hear-ing range between 20 Hz - 20 kHz. The click sounds in the click train therefore need to be loweredin frequency in some way to become audible. In this thesis project four distinct methods of creatingaudible click train sounds are used and can be selected in the Sound output settings menu in theGUI. These are called adaptive click duration, adaptive frequency, adaptive cycles and insert realporpoise click. The general approach is to model a click number n by a sum of two sine functionsas follows:

f(n)click(t) = A

(n)130 sin(2πf130t) +A

(n)60 sin(2πf60t), 0 s ≤ t ≤ T (n)

cd s

where A(n)130 is the amplitude output of the 130 kHz filter, A(n)

60 is the amplitude output of the60 kHz filter, f130 is the frequency corresponding to 130 kHz, f60 is the frequency correspondingto 60 kHz and T (n)

cd is the click duration for the current click.

28

A function f(n)click(t) is produced for every n numbers of clicks in the click train. The clicks are

then spaced with the corresponding silence with the inter-click interval T (n)ici to create the complete

click train sound.

8.2.7 Adaptive click duration

For the adaptive click duration method the general approach is used although both the 130 kHzfrequency and the 60 kHz frequency recorded by the AQUAclick 100 are represented by movingboth frequencies down in octaves 6 times. This means dividing both frequencies with 26 = 64 toobtain each representative frequency in hearing range. The frequencies thus become:

f130 = (130× 103)/26 Hz ≈ 2.031× 103 Hz

f60 = (60× 103)/26 Hz ≈ 0.9375× 103 Hz

Lowering the frequencies results in longer time periods. To complete one cycle for a sine func-tion with frequency f60, the time period T60 = 1/f60 = 1.067× 10−3 s is needed. It is not possibleto keep the original click durations in this case since the frequencies would for many cases beinaudible. This problem is solved by adding the time T60 to all n number of click durations withina given click train. This approach has the obvious downside of not staying true to the click duration.

T(n)cd := T

(n)cd + T60 s

8.2.8 Adaptive frequency

The adaptive frequency method is also based on the general approach. In this case there is onlyone sine function whose frequency is varied depending on the click duration. The drawback of thisapproach is that some frequencies produced are beyond the hearing range. The click duration istherefore set to have a minimum duration of 75 µs. This means that the highest possible frequencywill be fmax = 1/(75×10−6) Hz ≈ 13.333 kHz. The output amplitude from the 130 kHz filter A(n)

130

is kept to retain some information about whether the sound is produced by a harbour porpoise ornot.

f(n)click(t) = A

(n)130 sin(2πf

(n)t) , 0 s ≤ t ≤ T (n)cd s

where

f (n) = 1/T(n)cd Hz, T

(n)cd :=

{75× 10−6 s for T (n)

cd < 75× 10−6 sT

(n)cd s for T (n)

cd ≥ 75× 10−6 s

8.2.9 Adaptive cycles

In the adaptive cycles method the number of cycles that a sine function with frequency 130 kHzmakes in a given click duration is retained. The number of cycles is then kept for the frequencyf130 in hearing range that is chosen to represent the 130 kHz frequency.

m(n)cycles = T

(n)cd × f130kHz = T

(n)cd × 130× 103

The new click duration T(n)cd for frequency f130 is then calculated and replaced for all n num-

ber of clicks in a given click train by:

T(n)cd :=

m(n)cycles

f130, f130 = (130× 103)/26 Hz ≈ 2.031× 103 Hz

8.2.10 Insert real porpoise click

In the last method Insert real porpoise click each click is not modeled by a sine function. Insteadthe click is replaced by a real harbour porpoise click that has been slowed down 64 times. Theamplitude of the click is also in this case regulated by A(n)

130 for every click n in a given click train.

29

9 Evaluation

9.1 Performance measures

9.1.1 Precision and recall

For binary classification there are four possible outcomes of classification; true positive, false pos-itive, true negative and false negative. A true positive is a correctly classified positive and a falsepositive is an incorrectly classified positive. In the same way a true negative is a correctly clas-sified negative and a false negative is an incorrectly classified negative. For all binary classifiersit is desirable to maximize the number of true classifications and minimize the number of falseclassifications. To evaluate the performance of a binary classifier the measures precision and recallare commonly used. The precision measures how precise the classifier is and answers the followingquestion: For each time the classifier makes a positive classification how often is it correct? Preci-sion can thus be defined as follows:

Precision =Number of true positives

Number of true positives + Number of false positives

Recall on the other hand measures how many of all the positives that it actually classifies aspositives and answers the question: How often does the classifier classify what should be positiveclassifications as positive? Recall is therefore defined as:

Recall =Number of true positives

Number of true positives + Number of false negatives

In the case of harbour porpoise classification it is important to make sure that the classifiedclick trains are actually correctly classified. This is the same as saying that the precision needsto be high. It is in the same way desirable to have a high recall rate as to not miss out on anyporpoise click trains. There is however a trade-off to be made between recall and precision. It ispossible to have perfect precision and almost zero recall and vice versa. A high recall thereforeoften (but not necessarily) implies a lower precision. Due to this relationship between precisionand recall it is often not feasible to design an ideal classifier and it is instead necessary to balancebetween the precision and the recall.

9.1.2 The Fβ score

The Fβ score is a measure that takes into consideration both precision and recall to make a bal-anced evaluation of the performance based on both of them. The Fβ score allows for valuing theprecision and recall differently. If a high precision is more important than a high recall the betavalue is changed. The general formula for the Fβ score is as follows:

Fβ = (1 + β2) · precision · recall(β2 · precision) + recall

There are three Fβ scores selected to evaluate the performance in this thesis project and thoseare the F0.25, F0.5 and F1 scores. The F0.25 score places 4 times as much importance on precisionthan recall. The F0.5 score places 2 times as much importance on precision than recall, while theF1 score places equal importance on precision and recall.

9.1.3 Accuracy

It is also possible to evaluate the overall performance of a binary classifier with the accuracymeasure. This measure is used in this thesis project when evaluating the performance of theclassifier at the training stage. The accuracy of the classifier is compared for the training dataand the validation data for each epoch during training of a neural network. The training is thenstopped at the point when the accuracy of the validation data starts to diverge from the accuracyof the training data. When the diversion occurs the classifier has become overfit as described inSection 3.4.

30

The accuracy measure is defined as follows:

Accuracy =True positives + True negatives

True positives + True negatives + False positives + False negatives

The accuracy measure is not perfect and needs to be complemented with other measures. Themain problem with the accuracy measure is the so called accuracy paradox. The accuracy paradoxstates that if a binary classifier always makes negative predictions it can result in high accuracy.Such a classifier does not have any predictive power but is still considered to perform well accordingto the accuracy measure.

9.1.4 Loss

The loss is the output of the loss function for a given machine learning algorithm as described inSection 4.6.1. A machine learning algorithm is always optimized by minimizing the loss value. Theloss measure is however not used when evaluating the performance of the final classifier. Instead itis used during training in the same way as the accuracy. The loss of the classifier is thus comparedfor the training data and validation data of each epoch during training of a neural network. Thetraining is then stopped at the point when the loss of the validation data starts to diverge fromthe loss of the training data.

9.2 The classification thresholdThe final network in this thesis project outputs a likelihood value between 0 and 1. This means thatwe could define an infinite number of classifiers based on the network by simply selecting a differentthreshold value for what constitutes a positive classification. The obvious classification thresholdvalue would be 0.5 since it states that when the network is 50% sure that the input data should beclassified as positive it will be classified as positive. Choosing a different classification thresholdhowever results in a different number of true positives and false positives. Consequentially theprecision and recall are also different for every choice of classification threshold. The classificationthreshold should therefore be selected to optimize for the desired function of the classifier.

9.2.1 ROC curve

The ROC (receiver operating characteristic) curve shows how the amount of true positives andfalse positives changes as a function of a given feature threshold. The ROC curve in this thesisproject depends on the value of the classification threshold. As each value of the classificationthreshold results in a different amount of true positives and false positives it is beneficial to studythe changes of these to be able to select the optimal classification threshold. The ROC curve alsogives insight to the general performance of the network’s likelihood prediction. The ROC curve ofan ideal classifier is a right angle where the area under the curve (AUC) is 1. The worst possibleclassifier is on the other hand a straight 45◦ line with an AUC of 0.5.

9.2.2 Precision-recall curve

The precision-recall curve shows how the prediction and the recall changes as a function of a givenfeature threshold. In this thesis project the precision-recall curve is plotted as a function of theclassification threshold. The usage of the precision-recall curve is similar to that of the ROC curve.It is used to evaluate the performance of the classifier and to find optimal feature thresholds. Theprecision-recall curve for an ideal classifier is a right angle with the AUC of 1 and the worst possibleclassifier a straight 45◦ line with an AUC of 0.5.

31

9.2.3 Fβ optimization

Corresponding Fβ scores can be calculated for each point on a precision-recall curve. It is thenpossible to obtain the Fβ scores as a function of a given feature threshold. The feature thresholdthat results in the highest Fβ score is the optimal threshold given the desired function of theclassifier. The feature threshold used in this thesis project is the classification threshold and theFβ optimization is done for the three selected Fβ scores F0.25, F0.5 and F1 as explained in Section9.1.2. This implies that the classifier is optimized for different levels of precision. Note also thatthe classification levels "strict", "moderate" and "loose" as described in Section 8.2.3 correspondsto the classification thresholds optimized for the F0.25, F0.5 and F1 scores in that order.

32

10 Results

10.1 Final networkThe final network is a stacked network with three layers, two LSTM layers and one sigmoid layer asshown in Fig. 17 in Section 6.2.2. The first LSTM layer in the network consists of 20 units. Eachunit in the first layer receives 10 values which represent the 10 click train features as explained inSection 5.2.2. The size of the memory cell in the first layer is 22× 1, which means that the outputfrom each unit of the first layer is 22×1. The second LSTM layer consists of 20 units and each unitreceives the 22 output values of the corresponding units in the first layer. The second layer has amemory cell of size 11 × 1. The 11 outputs of the last unit in the second layer is then multipliedwith 11 corresponding weights and then passed through a sigmoid function. The sigmoid functionforces the final output to be between 0 and 1.

10.2 TrainingThe final model was trained and the number of epochs was selected by studying Fig. 24 and Fig.25. The training data was 60 % of the dataset or 1669 classified click trains and the validation datawas 20% of the dataset or 823 classified click trains as described in Section 3.3. The model accuracyshould be as high as possible and the model loss should be as low as possible. The validation dataand the training data should have roughly equal values for both the model accuracy and the modelloss. This indicates that the model (network) has generalized. It can be seen in Fig. 24 and Fig.25 that the LSTM network becomes overfit after approximately 50 epochs. This means that thebest point to stop the training was around 50 epochs. The final model was selected at 42 epochsdue to the best overall performance for training, validation and test data. The test data was 20 %of the dataset and was only used to evaluate the model after training.

Figure 24: This figure displays the accuracy of the algorithm for training and validation datafor the first 180 epochs. It is clear that the algorithm becomes overfit to the training data afterapproximately 50 epochs.

33

Figure 25: This figure displays the model loss of the algorithm for training and validation datafor the first 180 epoch. It is also clear that the algorithm becomes overfit to the training data afterapproximately 50 epochs.

10.3 OptimizingWhen the final network has been selected there are still an infinite number of classification modelsto obtain due to the classification threshold as described in Section 9.2. In order to select theoptimal classification threshold the precision and recall were calculated for the test data for 2340possible classification thresholds ranging from 0 and 1. In turn the F0.25, F0.5 and F1 scores werecalculated for each classification threshold as described in Section 9.2.3. Three distinct classifica-tion thresholds were selected that optimized the value of the three Fβ scores for the test data. Forthis classifier it is more important that a porpoise click train is correctly classified than that it ismissed, i.e. precision is more important than recall. As explained in Section 9.1.2, this is why theFβ scores are used.

The overall performance of the final network can also be studied in the precision-recall curveshown in Fig. 26 and the ROC curve shown in Fig. 27. It can be seen that both curves approachthe ideal point of perfect precision and recall or prefect true positive rate and no false positivesas described in Section 9.2.1 and 9.2.2. The area under the curves (the AUCs) are both close to1 which represents the ideal AUC for both the ROC and the precision-recall curves. The AUC= 0.94 in both cases. The curves in both plots are additionally far from the straight line whichrepresents random classification as a function of the classification threshold. To illustrate thatthe final network model has generalized well, two distinct curves are shown in both the ROC andthe precision-recall curves. One of the curves is evaluated for both training and validation data,i.e. all data that was used specifically to train the network. The second curve is evaluated on theindependent test data and is therefore the more important of the two to study. Both curves behavesimilarly which is an indication of good generalization. The networks thus perform similarly forall types of data in the dataset.

34

Figure 26: This figure displays the precision and recall as a function of the classification threshold.The straight line represents random classification.

Figure 27: This figure displays the true positives and false positives as a function of the classifi-cation threshold. The straight line represent random classification.

35

In Fig. 28 it can be observed that the F0.25 and F0.5 scores increase with a higher classificationthreshold. This is natural since these two scores value precision more than recall. The greater theclassification threshold is, the more certain the classifier has to be about the correct classification.It is therefore less likely that a positive classification will be incorrect with a higher classificationthreshold. This leads to higher precision with an increased classification threshold. The F1 scoreon the other hand has its peak close to 0.4. This is due to a high recall at this point while stillmaintaining decent precision. The recall naturally increases when the precision decreases. All ofthis contributes to a high F1 score.

Figure 28: This figure displays the Fβ scores as a function of the classification threshold.

10.4 Confusion matricesThe precision, recall and Fβ scores are calculated to evaluate the performance of each selectedmodel based on the optimal classification thresholds. The results are presented in tables calledconfusion matrices. The "class 0" tag represents the performance of correctly classifying non-porpoise click trains while the "class 1" tag represents the performance of correctly classifyingporpoise click trains. It needs to be clarified that precision and recall refers to the "class 1" per-formance in this thesis unless the opposite is explicitly stated.

During the training of the network the "natural" classification threshold 0.5 was used as shownin Table 1. It is for this classification threshold that the network is designed to work best. Theaccuracy for each epoch of training in Fig. 24 was calculated with this threshold. It is importantto note that even though the network is trained and optimized for the classification threshold 0.5it does not fulfill the specified performance that prefers precision over recall. The confusion matrixis still useful to compare the performance obtained during training to the performance obtainedfrom Fβ optimization. Observe that Table 2, Table 3 and Table 4 display the corresponding con-fusion matrix performance of Table 1. The corresponding Fβ scores have been optimized for "class1" performance and not for "class 0". This is because of the specification of classifying harbourporpoise click trains with high precision.

In Table 2 the precision for "class 1" has been increased to 0.98 while the recall has been de-creased to 0.60 as compared to Table 1. This pattern can also be observed in Table 3 and Table4. In Table 5 the optimal classification thresholds are listed with their corresponding Fβ scores. Itshould be noted that the F0.25 score reaches the highest value due to the high precision in Table 2.

36

Table 1: Confusion matrix for the test data with the classification threshold 0.5.

Precision Recall F1-score Test data sizeClass 0 0.89 0.91 0.90 339Class 1 0.89 0.86 0.87 283Average / Total 0.89 0.89 0.89 622

Table 2: Confusion matrix for the test data with the classification threshold optimized for the F0.25

score for class 1.


Table 3: Confusion matrix for the test data with the classification threshold optimized for the F0.5

score for class 1.


Table 4: Confusion matrix for the test data with the classification threshold optimized for the F1

score for class 1.


Table 5: Fβ scores for optimal classification thresholds

Fβ Classification threshold Fβ scoreF0.25 0.8512709140777588 0.936123348018F0.5 0.5718547701835632 0.883725270623F1 0.4109937250614166 0.870588235294

37

11 Discussion

11.1 General resultsThe classifier developed in this thesis project functions with either a well-balanced precision andrecall or with high precision and acceptable recall as can be seen in Table 1 and 2. The performanceis still not perfect. As for the F0.25 optimized classification threshold in Table 2, the precision isexcellent at 0.98 while the recall is merely 0.60. This means that 40 % of all positively classifiedporpoise click trains are false negatives. This could however imply that the classifier is moreconsistent and allows for less variation than the analyzer that constructed the dataset. Since thequality of the dataset could be improved further this effect is not necessarily unfavorable. Theperformance could also be evaluated with more test data to assure that the estimated performanceis correct.

11.2 Dataset improvementsThe main drawback while training and evaluating the classifier is the dataset. The dataset usedin this thesis project has been manually constructed by the author who is not a trained zoologist.If the dataset instead had been constructed by a professional zoologist with expert knowledge inharbour porpoises, the quality of the dataset would surely have improved. The performance of theclassifier is highly dependent on the quality of the dataset and consequentially the performance ofthe classifier would have improved as well. The current dataset is also slightly skewed with 1661"class 0" non-porpoise click trains and 1453 "class 1" porpoise click trains. This implies a 53− 47ratio which is suboptimal. The effect of this can be seen in Table 1 that has a higher F1 score for"class 0" than "class 1". The classifier is better at correctly classifying non-porpoise click trainspartly due to this imbalance. The quality of the data could also be improved by collecting datafrom many different sources. The dataset used in this thesis project consists of data collected inStrib, Denmark in 2008. This means that the training, validation and test data were all collected atthe same time. The similarity of the test data and the training data might also affect and bias theevaluation of the performance. A more varied dataset would therefore be preferable. The quantityof the data could also be increased. This would potentially boost the performance additionally.

11.3 Network improvementsThe network is very complex and each parameter affects the performance of the final classifier. Theperformance of the network could potentially improve by changing any of these parameters. Oneidea for future improvement is to have a different length of the click train part which is currently20. The value was chosen based on the fact that the average click train length in the dataset isapproximately 16. Since zero-padding is an important tool to allow for shorter click trains, thelength 20 was chosen to train the network to ignore zeros in all cases. Changing the click trainpart length arbitrarily is currently not possible since the manual classification relies on this length.This means that if the click train part length is changed then the manual classification has to beredone. The click train parts could potentially be resized to a size with a common denominator of20. For example, the click train part could be resized to 5. This means that each click train part inthe current dataset could be divided into 4 parts of 5 clicks each. Simply quadrupling the previousclassifications of the click train part would allow for obtaining classifications without redoing themanual classification process. One disadvantage of this method is that it can be difficult to classifyshorter click train parts and thus the classification that was valid for the longer click train part isno longer valid for the new click train parts. The advantage of changing the click train part lengthwould be to allow for having several networks that each take different lengths of the click trainparts as inputs. Each network would be optimized to classify one certain length. An assemble ofnetworks could potentially also help classify all click trains by a voting method.

38

It could also be of interest to develop a sequence-to-sequence LSTM network model that is trainedwith individually classified clicks. Such a model would ideally have a sequence of classified clicks asoutput while still maintaining a classification algorithm based on click train characteristics. Oneadvantage of a model like this would be that it is not constrained to a predefined click train partlength as discussed above. The method developed in this thesis project will in some cases positivelyclassify a given click train part that contains a small number of non-porpoise clicks and reverbs.If each click would instead be classified individually, these non-porpoise clicks and reverbs couldbe filtered out automatically.

The final network obtained in this thesis project was partly created on a trial-and-error basis.There are an unlimited amount of stacked networks that could still be tested. It is also possiblethat another machine learning algorithm would suit this problem better. In the future GRUs,support vector machines, random forests or another appropriate machine learning algorithm couldbe implemented. The performance could then be compared to that of the stacked LSTM networkdeveloped in this thesis project.

39

References

[1] Nick Tregenza. Powerpoint: C-pod, how it works. http://www.chelonia.co.uk/downloads/CPOD%20How%20it%20works.ppt, 2011. Accessed: 2017-08-24.

[2] Kolmården wildlife park: Harbour porpoise. http://www.kolmarden.com/hallbar_varld/insamlingsstiftelse/bevarandeprojekt/tumlare. Accessed: 2017-08-06.

[3] J Carlström, C Rappe, and S Königson. Åtgärdsprogram för tumlare 2008–2013 (phocoenaphocoena), 2008.

[4] Nicolas Lefebvre. Aquaclick v2 system manual. 2008.

[5] Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the bias/variancedilemma. Neural Networks, 4(1), 2008.

[6] Andrew Ng. Lecture: Nuts and bolts of building ai applications using deep learning, 2016.

[7] Yaser S Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data, vol-ume 4. AMLBook New York, NY, USA:, 2012.

[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,9(8):1735–1780, 1997.

[9] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual pre-diction with lstm. 1999.

[10] Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neuralnetworks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.

[11] David Sussillo. Neural circuits as computational dynamical systems. Current opinion inneurobiology, 25:156–163, 2014.

[12] Alex Graves et al. Supervised sequence labelling with recurrent neural networks, volume 385.Springer, 2012.

[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[14] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.http://www.deeplearningbook.org.

[15] Anne Villadsgaard, Magnus Wahlberg, and Jakob Tougaard. Echolocation signals of wildharbour porpoises, phocoena phocoena. Journal of Experimental Biology, 210(1):56–64, 2007.

40

http://www.chelonia.co.uk/downloads/CPOD%20How%20it%20works.ppt

http://www.chelonia.co.uk/downloads/CPOD%20How%20it%20works.ppt

http://www.kolmarden.com/hallbar_varld/insamlingsstiftelse/bevarandeprojekt/tumlare

http://www.kolmarden.com/hallbar_varld/insamlingsstiftelse/bevarandeprojekt/tumlare

http://www.deeplearningbook.org

TRITA TRITA-EE 2017:132

ISSN 1653-5146

www.kth.se

Documents

Harbour Porpoise Click Train Classification with …kth.diva-portal.org/smash/get/diva2:1146181/FULLTEXT01.pdfDEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM,