DCASE2016Workshop1Università Politecnica delle Marche, Italy 2Tampere University of Technology,...

Preview:

Citation preview

DCASE 2016CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION

Michele Valenti1 (valenti.michele.w@gmail.com),

Aleksandr Diment2, Giambattista Parascandolo2,

Stefano Squartini1, Tuomas Virtanen2

1Università Politecnica delle Marche, Italy2Tampere University of Technology, Finland

DCASE 2016CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION

Michele Valenti1 (valenti.michele.w@gmail.com),

Aleksandr Diment2, Giambattista Parascandolo2,

Stefano Squartini1, Tuomas Virtanen2

1Università Politecnica delle Marche, Italy2Tampere University of Technology, Finland

Outline

• Introduction

• Our system

•Training modes

• Results

• Challenge ranking

IntroductionWhat is “acoustic scene classification”?

Introduction

ForestpathCarHome

Audio

What is “acoustic scene classification”?

Our system

Audio Featureextraction

Sequence splitting

LabelCNN

Scores averaging

Overview

Our system

Log-mel spectrogramRaw audio

Features

AudioFeatures

Our system

Log-mel spectrogramRaw audio segment

Sequence

Sequence splitting

Sequence splitting

Features

Our system

Sequence

Convolutional neural network

Our system

SequenceFeature maps

128

Convolutional neural network

CNNSequences

Our system

SequenceFeature maps

128

Convolutional neural network

CNNSequences

Batch normalization

Our system

Sequence

128 128

Feature maps Subsampled feature maps

Convolutional neural network

CNNSequences

Our system

Feature mapsSequence

Subsampled feature maps

128 128 256

Newfeature maps

Convolutional neural network

CNNSequences

Our system

Sequence

256128 128

Feature maps Subsampled feature maps

Newfeature maps

Time shrinking

Convolutional neural network

CNNSequences

Our system

Sequence

256128 128

Flattening

Feature maps Subsampled feature maps

Newfeature maps

Convolutional neural network

CNNSequences

Our system

Sequence

256128 128

Fully-connected softmax layer

Feature maps Subsampled feature maps

Newfeature maps

Convolutional neural network

CNNSequences

Our system

Sequence

256128 128

Feature maps Subsampled feature maps

Newfeature maps

Convolutional neural network

CNNSequences

Our systemScores averaging

Class prediction scores

Scores averaging

Predictionscores

Our systemScores averaging

Class prediction scores

File’s class

!"Σargmax

Scores averaging

Predictionscores

Training

TrainingCross-validation setup

Test

Training + validation Test

Test

Test

Fold 1

Fold 2

Fold 3

Fold 4

TrainingNon-full training

Training

Validation

Training + validation Test

Fold n

TrainingNon-full training

Training

Validation

Training + validation Test

Fold n

Non-full training

Training Non-full training

Training

Validation

Acc

urac

ies

Epochs

Training

Training + validation Test

Fold n

Validation

Training Non-full training

Training

Validation

Acc

urac

ies

Epochs

Training

Training + validation Test

Fold n

Validation

Convergence time

Training Non-full training

Training

Validation

Training + validation Test

Fold n

Training

Training Non-full training

Training

Validation

Training + validation Test

Fold n

Training

Full training

Results

Test

Training + validation Test

Test

Test

Fold 1

Fold 2

Fold 3

Fold 4

Test data

ResultsSequence length

65

70

75

80

0,5 1,5 3 5 10 30

Acc

urac

y (%

)

Sequence length (s)

Non-full training Full training

ResultsSequence length

65

70

75

80

0,5 1,5 3 5 10 30

Acc

urac

y (%

)

Sequence length (s)

Non-full training Full training

ResultsSequence length

65

70

75

80

0,5 1,5 3 5 10 30

Acc

urac

y (%

)

Sequence length (s)

Non-full training Full training

ResultsClass accuracies

Class Accuracy (%)Beach 75.6

Bus 76.9

Café/Restaurant 74.4

Car 91.0

City center 93.6

Forest path 96.2

Grocery store 88.5

Home 80.8

Class Accuracy (%)Library 66.6

Metro station 96.2

Office 97.4

Park 59.0

Residential area 73.1

Train 46.2

Tram 78.2

Class Accuracy (%)Beach 75.6

Bus 76.9

Café/Restaurant 74.4

Car 91.0

City center 93.6

Forest path 96.2

Grocery store 88.5

Home 80.8

Class Accuracy(%)Library 66.6

Metrostation 96.2Office 97.4Park 59.0

Residentialarea 73.1Train 46.2Tram 78.2

ResultsClass accuracies

34.6% Residential area

29.5% Bus

ResultsOther classifiers

SystemSequence length (s)

Accuracy (%)

Non-full training Full training

Baseline GMM (MFCC) - - 72.6

Two-layer CNN (MFCC) 5 67.7 72.6

Two-layer MLP (log-mel) - 66.6 69.3

One-layer CNN (log-mel) 3 70.3 74.8

Two-layer CNN (log-mel) 3 75.9 79.0

Challenge rankingFinal training

Training + validation + test Secret challenge data

Extended training set Evaluation set

Challenge rankingFinal training

Training + validation + test

Extended training set Evaluation set

New training New validation

Secret challenge data

Challenge rankingFinal training

Training + validation + test

Extended training set Evaluation set

New training New validation

Secret challenge data

400 epochsconvergence

Challenge rankingFinal training

Training + validation + test

Extended training set Evaluation set

Final training for 400 epochs

Secret challenge data

Challenge ranking89,7 88,7 87,7 87,2 86,4 86,4 86,2 85,9 85,6 85,4 84,6 84,1

77,2

62,8

0

10

20

30

40

50

60

70

80

90

100

DCASE 2016CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION

Michele Valenti1 (valenti.michele.w@gmail.com),

Aleksandr Diment2, Giambattista Parascandolo2,

Stefano Squartini1, Tuomas Virtanen2

1Università Politecnica delle Marche, Italy2Tampere University of Technology, Finland

ResultsFeature comparison

SystemSequence length (s)

Accuracy (%)Non-full training Full training

Two-layer CNN (MFCC) 5 67.7 72.6

Two-layer CNN (log-mel) 5 74.1 78.3

Recommended