Acknowledgements - · PDF fileSpeech/Music Discriminator 1 Acknowledgements I wish to espress my sincere gratitude to those who have helped me make this six-month work possible

Speech/Music Discriminator

work

me

work.

sup-

ts on

ory for

with

ro-

Acknowledgements

I wish to espress my sincere gratitude to those who have helped me make this six-month

possible.

A special thank to Mr. Ioan TABUS, head of the international programs, for having welcomed

in the Signal Processing Laboratory. Thanks also for his advice and nice cooperation in my

I am very grateful to Anssi KLAPURI, my thesis advisor, for his guidance and continuous

port, and also for his patience and his good mood. I thank him also for his useful commen

this thesis.

Thanks to all the members of the Audio Research Group and the Signal Processing Laborat

providing such a good atmosphere, and to all the people I met in Tampere University and

whom I spent a nice time.

Thanks also to Mr. REFREGIER and to Mr. KAKOURIDIS who permitted this exchange p

gram between my school and the Signal Processing Laboratory.

1


rchi-

devel-

com-

on is

ro-

ental

erna-

sev-

TAMPERE UNIVERSITY OF TECHNOLOGY

http:\\www.tut.fi

Tampere University of Technology provides high-level education in technical science and a

tecture, engages in research and product development and furthers technical and scientific

opment in Finland.

In addition, TUT serves public institutions as well as industry by means of shared projects,

missioned research and diploma projects.

TUT offers three one-year International University Programs, whose langage of instructi

English. The programs are in the following fields: Biomedical Engineering, Digital Signal P

cessing, and Material Science. English is due to be introduced in Architecture, Environm

Enginering ansd Mechanical Engineering as the langage of tuition. Approximately 200 int

tional students attend Tampere University of Technology.

TUT offers programs covering a wide spectrum of technological knowledge and skills. The

eral departments are:

· Department of Architecture

· Department of Mechanical Engineering

· Department of Electrical Engineering

· Department of Civil Engineering

· Department of Material Science

· Department of Information Technology

· Automation Department

· Department of Industrial Engineering and Managment

· Environmental Department

2


rma-

tech-

This

logy

g the-

bility

e. This

tly.

al and

hnol-

digital

circuit

jects

in the

gth in

tributed

DEPARTMENT OF INFORMATION TECHNOLOGY

http:\\www.cs.tut.fi

· General InformationData processing - the combination of computer and information technology - has made info

tion technology into a discrete whole, which is required as a separated area of expertise in

nology and as a provider of indispensible basic skills for other areas in technology.

development is reflected by the differentation in 1993 of the Faculty of information Techno

from the Faculty of Electrical Engineering.

· Mathematicsin addition to teaching, the Institute of Mathematics is actively engaged in research on codin

ory, hypermedia, mathematical control theory, fuzzy mathematics, optimization, computa

theory, and other areas of mathematics.

Computer-aided hypermedia based teaching methods are being developed at the institut

hypermedia teaching can be reached at a variety of levels and facilitates studying significan

· Signal ProcessingThe Signal Processing Laboratory is the eleading research unit in Finland in the field of sign

image processing, offering teaching of the highest order in many subareas of information tec

ogy.

The most central research areas are signal processing algorithms and applications,

image-processing methods, speech coding and processing, digital audio, digital processing

design, and VLSI implementation. The institute is participating on the EU research pro

RACE, COST, EUREKA, and ESPRIT.

· Software systemsThe importance of software systems is based on the central role of software engineering

applications of information technology and in product development. Central areas of stren

teaching and research are modern software methods and tools, design of embedde and dis

3


serve

muni-

sys-

ct on

uction

systems, formal methods, and compute networks. Activities, which have been developed to

domestic industry, have achieved international recognition.

· TelecommunicationsActivities focus on the central area in telecommunications of the 1990’s broadband telecom

cation networks, digital transmission techniques, and the new digital mobile communication

tems. Together with several other institutes of the University, on extensive research proje

broadband network technology has been set up. This FASTER project includes the constr

on an ATM text network, research, and industrial projects in the field of ATM engineering.

4


1988.

nsist-

tems

parts

rch and

nal

select

s been

orship

a have

profes-

search

sig-

rates

about

y stu-

ing or

THE SIGNAL PROCESSING LABORATORY

http:\\sigwww.dmi.tut.fi

· IntroductionThe Signal Processing Laboratory at Tampere University of Technology was inaugurated in

The Signal Processing Laboratory belongs to the Department of Information Technology co

ing of five laboratories.

The laboratory can be roughly divided into Digital Signal Processing and Computer sys

parts. The most important change in 1996 was the modernization of the curriculum of both

to better reflect the current trends in research and industry. In signal Processing both resea

teaching can be divided into the following four subtopics: Digital filtering, image and video sig

processing, audio and speech and multimedia. Those majoring in signal processing lab now

one (or more) of these topics. The coverage of signal processing all important areas ha

made possible by founding two posts of Associate Professors. After founding a new profess

in Computer Systems from the beginning of 1996, both the research and teaching in this are

been expanded and modernized. This trend has been strengthened by founding a part-time

sorship and an associate professorship.

Numerous researchers are working in both applied and basic digital signal processing re

projects which include multimedia, digital video, digital audio, biomedical signal processing,

nal processing for telecommunications and VLSI system design. The laboratory co-ope

closely with industry and international research groups. The laboratory personnel comprises

120 staff members.

Lectures in digital signal processing and basic computer systems are given to all universit

dents. After their basic studies, students specialize in either or both Digital Signal Process

5


jects. In

ogram

ogram

ree pro-

clud-

nfer-

eloped

M-

for

of the

ical

and

ther is

s.

ess-

tional

degree

ies in

ilities

bora-

US

Computer Engineering. Several post graduate level courses are also given in the above sub

1989, an International Program in Digital Signal Processing was started. Courses in this pr

are given in English. Some 20 foreign and Finnish students are selected to the one-year pr

each year. The program has been a great success, and it was extended to a master’s deg

gram in 1992.

In 1996, the laboratory personnel participated in a large number of international activities in

ing the organization of international conferences. The number of international journal and co

ence papers presented was 78 in total. During 1996, also international cooperation dev

positively with participation now in all major European programs: ESPRIT, COST, RACE, CO

ETT, EUREKA, ERASMUS, and TEMPUS. In early 1997, Tampere International Center

Signal Processing was founded in order to enhance the international visibility and contacts

laboratory.

· Educational Activities.The Signal Processing Laboratory is a department of the Technolog

University of Tampere. Consequently, an important part of its activities is in teaching M.Sc.

Ph.D. level students.

The Laboratory has two major teaching areas. One is, naturally, signal processing, and the o

digital systems including digital circuit design, computer architectures and integrated circuit

One Speciality of the Laboratory is an International University Program in Digital Signal Proc

ing. This is a one-year course package in the field of signal processing offered to interna

M.Sc. and Ph.D. level students. The teaching language is English. Students holding a B.Sc.

or equivalent in the area of electrical engineering have the possibility to continue their stud

the Laboratory up to a M.Sc. and further to Dr.Tech. degrees.

The international program, supporting student ans teacher mobility, offers important possib

for students coming to Finland or Finnish students going abroad. The Signal Processing La

tory is actively participating, for instance, in the European Community TEMPUS and ERASM

programs.

6


· Research ActivitiesThe main research areas are the following:

Speech and Audio· Enhancement of Speech in Harmonic Noise

· Collection of a Speech Database

· Audio Coding

· Angle of Arrival Estimation Using Small Sensor Arrays

· Audio Content Analysis

Video Signal Processing· Multibase Transform Coding

· Disparity Field Estimation for Video Sequences

· Motion Vector Improvement in Block Matching Algorithms

Nonlinear Signal Processing· Filter Structures

· Adaptative Algorithms

· Optimization Methods

· Implementations

· Document Imaging

Transforms and Spectral Techniques· Design

· Spectral Methods in Nonlinear Filtering

· Spectral Methods in Classification

7


Image Processing· Lossless Image Compression Using Predictive Boolean and Stack Filters

· Image Segmentation

· Image Component Labeling

· Color Image Processing with B-Spline Modeling

· Nonlinear Color Image Restoration

Interactive Multimedia Technology· Broadband Multimedia Systems

· Multimedia User Interfaces

· Digital Video Quality

· VRML (Virtual Reality Modelling Language)

· Tools for Information Retrieval and Organization

· Integrated Publishing in Multimedia Networks

VLSI Circuit Design for Signal Processing· DSP Processor Design

· Logarithmic Arithmetic Design

· DSP Core Design

Parallel and Multimedia Systems

· Parallel Neurocomputer Systems

· Multimedia Enabling Technology

Embedded Systems Design Methodology· System modelling

· Formal Specification

· HW/SW co-design

· Digital Signal Processor Implementations

8


Learning Systems· Neural Networks

· Speech Coding

· Tracking algorithms

Medical Informations Processing Systems· Computerized Vigilance Analysis

· EEG Analysis with Neural Networks

· Medical Device Communication Standardization

· Processing of Medical Isotopic Images

· 3-D Visualization of Anatomic Data

Design and Implementation of Digital Filters and Filter Banks for Various Applications· Synthesis of IIR Filters

· Synthesis of Computationally Efficient FIR Filters

· Optimization of Digital Structures for VLSI Implementation

· Design of Decimators and Interpolators

· Synthesis of Filter Wavelet Banks

· Filters for Telecommunication Applications

· Other Applications

9


by the

orking

lication

try of

:

at and

ABOUT SPEECH AND AUDIO

http:\\www.cs.tut.fi/sgn/arg

Research related to speech and audio in the Signal Processing Laboratory is conducted

Audio Research Group. This group was refounded in 1996 and is composed of 20 people w

in speech and audio projects. Research topics include both basic research and more app

oriented research which is financed by industrial companies,the European Union, the Minis

Finland and the Finnish Academy. Below is a short description of the main projects of 1997

· SpeechCollection of finnish speech databases for speech recognition development (SpeechD

SpeechDat-Car)

Finnish speech recognition

Wideband speech coding

· MusicMusic content analysis

Automatic transcription of music

· Spatial AnalysisAngle of arrival estimation for pulsed sound waves

Blind deconvolution for speech/audio signals

10


· Audio CodingInvestigation of possibilities in using various transforms in audio coding

Research on advanced quantization, coding, and bit allocation methods

Studies of advanced utilization of psychoacoustical phenomena

· Non-Linearity in AudioConstruction of a subwoofer controlled with a digital signal processor

Research of a digital power amplifier

11


t anal-

h and

selec-

n two

eas of

s first

omising

algo-

mance

based

imu-

ABSTRACT

TAMPERE UNIVERSITY OF TECHNOLOGY

Department od Information Technology

Signal Processing Laboratory

PHILIBERT, ARNAUD B.S.: Speech / Music Discriminator

Engineering Project, 32 pages

Examiner: Prof. Ioan TABUS, Bernard GILLY

March - August 1999

Keywords: speech, music, signal classification, speech / music discrimination, audio conten

ysis, speech processing

The study concerns the automatic classification of audio signals into two classes: speec

music. According to the signal processing practice, this requires extraction, evaluation and

tion of features and then selection of a suitable classification method.

These two subproblems were studied and are reviewed. Emphasis of this study is laid o

points: on literature review and on algorithm development.

A literature review was conducted on music and audio classification and several related ar

interest. An appropriate decomposition of the problem and the selection of an approach i

considered. Then the state-of-the-art of the research is represented and discussed, and pr

directions for further work are indicated.

The main and maybe the most interesting part of this study concerns the development of

rithms first to calculate the features used and then to conceptualize a classifier. The perfor

of the method is evaluated by applying it on a database created in the course of this project,

on various kind of music and different langages. All the algorithms were implemented and s

lated in Matlab environment.

12


eople

ltina-

oncen-

take

allow

sys-

erate

panies’

opics.

d: We

present

pre-

ed us

k. We

ir rel-

n itself:

ignals,

and

ack-

the

recise

clude

INTRODUCTION

Due to the increase in volume of the telecommunication field (the European proportion of p

owning a mobile phone averages 30%), a lot of small firms appeared to supply the huge mu

tional enterprises. They evolute as satellite around leaders of telecommunication systems, c

trating their effort on bringing solutions to them. Universitary laboratories did not forget to

benefit of the situation and the Department of Information Sciences met a restructuration to

a more competitive profil, faced to the Finnish giant builder of wireless telecommunication

tems:NOKIA. The Signal Processing Laboratory was created in this way so that it could gen

some wide-open research studies. The advantage of such a laboratory compared to com

research centers is that it is usually almost “free” concerning the choice of its research t

Thus all the aspect of a problem are studied. And this is the way my project was conducte

were not to build one discriminator, the best one, the fastest one or the cheaper one, but to

a panel of possible classifiers.

Not being the only group which interests itself on such a topic, we first draw a roadmap of the

vious studies, collecting all kind of informations: features, parameters, classifiers... It allow

to verify some feelings, correct some others, and list the features upon which we would wor

then began to write codes extracting the features from an audio signal, in order to verify the

evancy. This was done through a first set of samples. The second step was the classificatio

We needed some material to improve the codes and we spent some time colecting audio s

from all sorts. Then started the several improving of the classifiers. We selected four of them

conducted a quick study on each. The following pages respect this chronology.

Even if my work is quite similar to a Master Science thesis, I did not forget my engineering b

ground when I wrote this report. That means I did not enter too far in the details. I specify

main decisions, explain the choice of the parameters, present the main results, and I just p

where more informations can be found. The review notes provided here are intended to in

just enough material to render the work comprehensible.

13


terest.

on. The

en we

r clas-

bove-

Group

rature

rchers

king

rmation

m in

the dif-

uts and

much

l. The

most

LITERATURE REVIEW

A literature review was conducted on speech/music discrimination, and related areas of in

These are speech features, music features, feature extraction, and temporal segmentati

meanings of these terms were given in the introduction.

In this chapter, we first describe the methods that were used in the search of literature. Th

briefly describe the state-of-the-art and history of designing speech / music discriminators o

sifiers. Finally, we summarize and discuss the most important references we found in the a

mentioned research areas.

MethodsThe research in the Signal Processing Laboratory and especially the Audio Research

includes audio content analysis for several different purposes. Thus the main part of the lite

was already selected. In addition to that, a major facilitating observation was that most resea

give their publication indices on-line on the World Wide Web. Finally, some laboratories wor

in the same areas of interest usually send their papers to others researchers thus the info

spreads quickly and can be stored.

After becoming familiar with the relevant topics, we started searching for publications on the

the abovementioned ways. We collected many references, then we looked for the papers in

ferent databases such as IEEE, seminars as ICASSP, and the web sites of the main instit

universities. We took also a look at some patents written on the subject.

Our scope was efficient enough but limited by several factors. There is sometimes too

resources on a unique point of interest and it can be uneasy to select what would be usefu

other problem is timing: To surf on the web and through the IEEE papers takes time and the

relevant papers were those which some others researchers recommended to us.

14


usic

on and

ction

over a

at he

error

it was

hey

lready

.

Related workThe Signal Processing Laboratory is obviously not the only institut working on a speech / m

discriminator. Many researchers, and not the worst, leaded studies upon audio classificati

feature extraction. Eric Scheirer’s discriminator1 and the one of Saunders2 are among the most

reliable classifiers. The first one, which is the most related to our work, reports on the constru

of a real-time computer system capable of distinguishing speech signals from music signals

wide range of digital audio input. This will be discuss later but we can already present th

managed to implement a classifier with only 5.8% error on a frame-by-frame basis and 1.4%

when integrating long segments of sound3.

Some others research centers we can not list here are actually working on the subject and

useful to read their results.

Roadmap to the most important referencesBellow we list the most important papers from our point of view, written on the subject. T

allowed us to determine which features could be used, and repertoriated the difficulties a

met. For each article, we summarized the contents and listed the main features.

1. “Construction nad Evaluation of a Robust Multifeature Speech/Music Discriminator” by Eric Scheirerand Malcom Slaney in theProceedings of the 1997 International Conference on Acoustics, Speech andSignal Processing,Munich, April 21-24, 1997.

2. “Real-Time Discrimination of Broadcast Speech / Music” by John Saunders, A Lockheed Martin Co.,Nashua, NH 03061.

3. The frames were 2.4 seconds long.

IEEEICASSP

World Wide Web

Others seminarsPatents

15


of dis-

nders,

s used

xperi-

feature

d by a

mples,

tabase

ver the

al Error

ude.

duced

vari-

Rate,

s very

ndow

ature

a Mul-

nother

amples

e able

" A Comparison of Features For Speech, Music Discrimination" by M.J. Carey, E.S. Parris

and H. Lloyd-Thomas.

The authors begin with a summary of the main approaches previously taken to the problem

criminating between speech and music signals. They list all the features investigated by Sau

Zue and Spina, Scheier and Slaney. They insist on the fact that none of the works so far ha

pitch and amplitude features explicitly and thus they concentrate on them. They carried e

ments on amplitude, cesptra, pitch and zero-crossings. In each case, the derivative of the

was also used. Through the database, the distributions in the feature space were modelle

Gaussian Mixture Model.The database was composed of most than a thousand of 10s sa

band-limited to 4kHz and sampled at an 8kHz rate, and the frames are 10 ms long. The da

assembled for these tests includes speech from thirteen langages and music from all o

world.

The best performance resulted from using the cepstra and delta cepstra which gave an Equ

Rate (EER) of 1.2%. This was closely followed by normalized amplitude and delta amplit

Pitch and delta pitch gave an EER of 4% which was better than the zero-crossing which pro

and EER of 6%.

" Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator" by Eric

Sheirer and Malcom Slaney.

This is one of the most robust discriminators at the moment. It examined 13 features: 5 are

ance features (variances of Rolloff Point, Spectral Centroid, Spectral Flux, Zero-Crossing

and Cepstral Residual Magnitude). He explains that if a feature has the property that it give

different values for voiced and unvoiced speech, but remains relatively constant within a wi

of musical sound then the variance of that feature will be a better discriminator than the fe

itself. Other features are such as Pitch estimation and Low-Energy. Sheirer decided to use

tidimensional Gaussian Maximum estimator, one based partially on Decision Trees, and a

based on k-Nearest-Neighbors. Database is composed of 80 15-second-long audio s

recorded at a 22.05 kHz sampling rate from an FM tuner.

He obtains some all-features classifications with an error rate around 6.8 +/- 3.5%. It may b

to discriminate voiced from unvoiced speech.

16


rallel

y con-

te that

rithm

tors to

rpose

mpu-

itation

ders are

tech-

a FM

ents

plause.

d of

ifica-

peech,

als are

In another article1, Eric Sheirer presents a method for using bandpass filters and banks of pa

comb filters to analyse the tempo and extract the beat from music signals. Scheirer is onl

cerned with beat and tempo and avoids what is relative to rythm. He manages to demonstra

the performance of his algorithm is similar to the performance of human listeners. This algo

is based on the one implemented by Large and Kolen, in that it uses a network of resona

phase-lock with the beat of the signal and determine the frequency of the pulse.

" Real-Time Discrimination of Broadcast Speech/Music" by John Saunders.(1996).

This article contains details of techniques that Scheirer borrowed but not explained. The pu

was to discriminate speech from music on broadcast FM radio. Thus, it requires minimal co

tation. Saunders explained a few of features which can be used, as tonality, bandwidth, exc

patterns, tonal duration and energy sequences. But the main features investigated by Saun

energy, and zero-crossing rate. A sample rate of 16kHz was chosen for this discrimination

nique. The database was composed of a block of 2.4 seconds containing 150 frames, from

tuner. A Multivariate-Gaussian classifier was used.

Experimental results show performance approaching 98% correct classification.

" A Fast Audio Classification from MPEG Coded Data" by KDD R&D Labs and University of

Electro-Communications.

Two points differenciate this algorithm from the others. First, it is based on the silent segm

detection Remaining segments are then classified into three types: music, speech and ap

Secondly, it uses Bayes discriminant function for multivariate Gaussian distribution instea

manually adjusting a threshold for each discriminator. This gives a 90% rate of good class

tion.

" Hierarchical Classification of Audio Data for Archiving and Retreiving" by Tong Zhang and

Jay Kuo.

The authors implemented a system able to classify and retrieve audio signals. First, s

music, environmental sounds and silence are discirminated. Second, environmental sign

1. "Tempo and Beat Analysis of Acoustic Musical Signals" by Eric D. Scheirer (1996)

17


l like-

ses a

main

tal Fre-

e pro-

l to see

and

nd

HMM

how

H.

d

s

further classified into subclasses using time-frequency analysis, and Hidden Markov Mode

lyhood (a model was built for each class). In the third phase, the system proces

query-by-example audio retrieval by computing the various probabilities. There are two

classes of features: Physical features such as Energy, Zero-Crossing Rate, Fundamen

quency and Perceptual features (Timbre and Rythm).

Preliminary experiments showed that accuracy rate of over 80% can be achieved with th

posed fine classification method.

" Recognition of Music Types" by H. Soltau, T. Schultz, M. Westphal, A. Waibel from Inter-

active Systems Laboratories.

This paper focuses on music classification. No speech signals are considered, but it is usefu

how we can discriminate music types through the Hidden Markov Models method (HMM)

Explicit Time Modelling with Neural Network method (ETM-NN). The purpose is to index a

search music tapes in multimedia databases.

First signals need to be preprocessed in order to extract the cepstral feature, then launch a

analysis.

The following table listed all the feature used by the several discriminators and repertoriate

often it was used and by which researcher (referenced in the following table).

TABLE 1. References of the articles

reference article

1 “A Comparison of Features For Speech, Music Discrimination” by M.J. Carey, E.S. Parris and

Lloyd-Thomas.

2 “Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator” by Eric Sheirer an

Malcom Slaney.

3 “Tempo and Beat Analysis of Acoustic Musical Signals” by Eric D. Scheirer (1996).

4 “Real-Time Discrimination of Broadcast Speech/Music” by John Saunders.(1996).

5 “A Fast Audio Classification from MPEG Coded Data” by KDD R&D Labs and University of Electro-Com-

munications.

6 “Hierarchical Classification of Audio Data for Archiving and Retreiving” by Tong Zhang and Jay Kuo.

7 “Recognition of Music Types” by H. Soltau, T. Schultz, M. Westphal, A. Waibel from Interactive System

Laboratories.

18


TABLE 2. Features used into Discriminators

1 2 3 4 5 6 7

Amplitude ● ●

Delta Amplitude ●

Pitch ● ●

Delta Pitch ●

Four-Hertz Modulation Energy ●

Roll-Off of the Spectrum ●

Variance of the Roll-Off feature ●

Spectral Centroid ● ● ●

Variance of the Spectral Centroid ● ●

Spectral Flux ●

Variance of the Spectral Flux ●

Cepstral Residual ● ●

Variance of the Cepstral Residual ● ●

Cepstral feature ●

Pulse Metric ●

Zero-Crossing Rate ● ● ● ●

Standard Deviation of first order differ-ence of ZCR

● ● ●

Third Central Moment about the Meanof ZCR

●

Total number of zero-crossing exced-ing a treshold

●

Difference betwwen zero-crossing sam-ples above and below the mean

●

Number of Energy minima below a rel-ative treshold

●

Number of silent segments ●

Bandwidth of signal ●

hidden unit (HMM) ●

19


The features can be ordered in the following way1:

signal “brigthness” features

- Centroid and his variance

- Zero-Crossing Rate and his variance

- Roll-Off point

- Fundamental Frequency

Spectral Shape features

- Cepstrum

- Cepstral Delta

TABLE 3. Cross-references between researchers

Vertical refers to horizontal

Sch

eire

r &

Sla

ney

Sau

nder

s

Car

ey, P

arris

& L

loyd

-Tho

mas

KD

D R

&D

Lab

s

Sol

tau

& S

chul

tz

Scheirer & Slaney ● ● ●

Saunders ● ● ● ●

Zue & Spina ●

Povel & Essens ●

Desain & Honig ●

Large & Kolen ●

Longuet-Higgens & Lee ●

Rosenthal ●

Brown ●

Todd ●

Kedem ● ● ●

Bogert ●

1. This is just a preliminary characterization of the features. The entire definition would be given later.

20


ut we

l sim-

ally

plied to

Harmonic/Noisy features

- Cepstral Residual

- Pitch (relative to the “color”’s tone of the signal)

Amplitude Envelope features

- 4Hz modulation

- Pulse metric

-Rythm

- Energy

Commercial productsNowadays, none of the discriminators has yet been applied in a commercial system. B

already know that one of the application could be on broadcast FM radio. The computationa

plicity of the approach could lend itself to wide application including the ability to automatic

change channels when commercials appear. And anyway, the feature extraction may be ap

speech and music recognition and thus be widely used.

21


The

we can

s, we

ertori-

g their

lculate

envi-

rithm,

traints

l-time

is not

terials

ature

sifiers

teams

several

tory’s

se with

com-

DECOMPOSITION OF THE CLASSIFICATION PROBLEM

Before using any classifier algorithm, we had to think about the criteria of discrimination.

previous chapters and the several articles on the subject indicated there are many features

ever use but depending of the computing environment and the aims of the application system

were ought to select various of the features. The previous work on the subject helped to rep

ate the criteria, but we first had to evaluate them by ourself and make a decision concernin

efficiency. So that we coded algorithms to improve each features and draw diagrams and ca

figures to help in our decision.

Because it is a "free" research, we had none specific constraint but the Matlab computing

ronment. So we decided to collect as many features as we could. Thus the classifier algo

whatever it would be, would be as performant as is the number of features. The only cons

were the relatively speed of computation (the main commercial applications seem to be rea

systems) and the low memory rate needed during the computation (the Matlab simulator

really efficient considering this aspect).

This had to be completed with the building of an efficient database. The more we have ma

and the better it is for the classification improvement. We needed this first to improve our fe

extractors, and then to develop the various classifiers: The way to do so is to build the clas

by determine the parameters with training data, and verify the efficiency with others. Others

working on audio classification average a several hundreds of signals, and some reach

thousands. An important database of music signals were already stored in the labora

archives but they were lacking of enough speech samples. We thus completed the databa

speech records, collected through the laboratory staff. The advantage of this staff is that it

prices many different nationalities so that we recorded various langages and accents.

22


e also

id not

ercial

ied to

Eventually, comes the classification by itself. We had a panel of available classifiers. Ther

the constraint of Matlab computing environment forced us to use some classifiers which d

need so much memory. At least, it was a "good" constraint regarding on the future comm

applies which impose to be fast calculators. So our choice was almost unlimited and we tr

improve the most current or famous one.

.

Related works

Features improving

Material improving

CLASSIFICATION

Speech DAT RecordingARG Files

23


studied

scribed

fea-

carried

lue of

feature

e mean

. In the

, then

r the

d

nal.

as an

m of

cep-

rating

t on to

r, only

trum,w York,

CHARACTERISATION OF AUDIO SIGNAL: FEATURE EXTRACTIONAND TEMPORAL SEGMENTATION

As we noticed in the previous review of the related studies, several approaches have been

and each of these used different features and pattern classification techniques, and de

results on different material. In this report we study the relevancy of several of the different

tures previously suggested and especially of some we believe have useful properties. We

out tests on the same data. In this way we attempted to provide a comparative view of the va

the different types of features in speech / music discrimination.

Mathematical definition of the main features and improvingWe first started by using 15 music samples and 15 speech samples. We calculated the

value for each signal, plotted the results (standard deviation and mean), and determined th

and the mean of the variance of each feature first for all music samples, then for speeches

same way, we also give the variance of the mean of the features first for all music samples

for speeches. Later we refined with a larger database.

Cepstrum. As a preliminary view, let’s say the cepstrum measures the spectrum shape o

sound “color”. The term cepstrum was introduced by Bogert et al1 and has come to be accepte

terminology for the inverse Fourier transform of the logarithm of the power spectrum of a sig

They observed that the logarithm of the power spectrum of a signal containing an echo h

additive periodic component due to the echo, and thus the Fourier transform of the logarith

the power spectrum should exhibit a peak at the echo delay. They called this function the

strum, interchanging letters in the word spectrum because "in general, we find ourselves ope

on the frequency side in ways customary on the time side and vice versa.". Bogert et al. wen

define an extensive vocabulary to describe this new signal processing technique; howeve

the term cepstrum has been widely used.

1.B. P. Bogert, M. J. R. Healy, and J. W. Tukey. The Quefrency Alanysis of Time Series for Echoes: CepsPseudo-autocovariance, Cross-Cepstrum, and Saphe Cracking, pages 209-243. John Wiley and Sons, Ne1963.

24


l spec-

y first

result

trum is

error

s com-

n a

oise,

c than

cen-

higher

occur-

The cepstrum residual is the margin between the smoothing of the spectrum and the origina

trum.The cepstrum residual is thus calculated: The value of this feature is determined b

computing the cepstrum of the spectrogram by means of a Discrete Fourier Transform. The

is then smoothed over a time window, and the sound is resynthesized. The smooth spec

then compared to the original (unsmoothed) spectrum, to obtain an error value. Usually the

value is higher for voiced speech and music. The cepstrum resynthesis residual magnitude i

puted as follows:

Spectral Centroid. It represents the balancing point of the spectral power distribution withi

frame. Many types of music involve percussive sounds which, by including high-frequency n

result in a higher spectral mean. In addition, the excitation energies can be higher for musi

for speech, in which pitch stays within a range of fairly low values. As a result, the spectral

troid for music is, on average, higher than for speech. In addition, the spectral centroid has

values for unvoiced speech than it does for voiced speech. The spectral centroid for a frame

ing at time t is computed as follows:

CR X k[ ] Y k[ ]–( )2

k∑≡

SC

k̇X k[ ]k∑

X k[ ]k∑------------------≡

25


verall

nd.

s the

h more

transi-

ically

partic-

the

n a

he time

where k is an index corresponding to a frequency, or small band of frequencies, within the o

measured spectrum, and X[k] is the power of the signal at the corresponding frequency ba

Two different examples of Spectral Centroid

each represented by a∆

Spectral Flux.This feature measures frame-to-frame spectral difference thus it characterize

changing in the shape of the spectrum. Speech has a higher rate of range, and goes throug

drastic frame-to-frame changes than music. We can Note that speech alternates periods of

tion (consonant - vowel boundaries) and periods of relative stasis (vowels), where music typ

has a more constant rate of change. As a result the spectral flux value is higher for speech,

ulary unvoiced speech, than it is for music.

The Spectral ‘‘Flux’’ (also known as the Delta Spectrum Magnitude) is the 2-norm of

frame-to-frame spectral amplitude difference vector:

Zero-Crossing Rate.This value is the number of time-domain zero-voltage crossings withi

speech frame. In essence, the zero-crossing rate indicates the dominant frequency during t

period of the frame. This is a correlate of the spectral centroid.

SF X k[ ] X k 1+[ ]–≡

26


trum

y in the

alues

of the

ctral

rmine

ther

can’t

S

litude

more

Rolloff Point. This value measures the frequency below which 95% of the power in the spec

resides. Music, due to percussive sounds, attack transients, and the like, has more energ

high frequency ranges than speech. As a result, the spectral roll-off point exhibits higher v

for music and unvoiced speech, and lower values for voiced speech. This is a measure

‘‘skewness’’ of the spectral shape--the value is higher for right-skewed distributions. The spe

roll-off value for a frame is computed as follows:

SR = K, where

Two kinds of Rolloff Point of Spectrum

Pulse metric. This is a feature which uses long-time band-passed autocorrelations to dete

the amount of ‘‘rhythmicness’’ in usually a 5-second window. It does a good job telling whe

there’s a strong, driving beat (ie, techno, salsa, straightahead rock-and-roll) in the signal. It

detect rhythmic pulse in signals with rubato or other tempo changes.

Percentage of ‘‘Low-Energy’’ Frames.This value measures the proportion of frames with RM

power less than 50% of the mean RMS power within a one-second window. This is an amp

feature. The energy distribution for speech is more left-skewed than for music--there are

quiet frames--so this measure will be higher for speech than for music.

X k[ ] 0.95 X k[ ]k∑=

k K<∑

SpectrumAmplitude

Frequency

SpectrumAmplitude

Frequency

27


4 Hz

per-

second

g and

rgy in

ergy at

on the

ut lost

aining

ss and

te the

4Hz

when

tures.

emory

pro-

se fea-

rs will

to not

4 Hz modulation energy .Speech has a characteristic energy modulation peak around the

syllabic rate. We use a filterbank (with logarithmic scale) to convert the audio signal into 8

ceptual channels. We extract the energy in each band, bandpass filter each channel with a

order filter with a center frequency of 4 Hz, then calculate the short-term energy by squarin

smoothing the result. We normalize each channel’s 4 Hz energy by the overall channel ene

the frame, and sum the result from all channels. Speech tends to have more modulation en

4Hz than music does. Pratically, we can usually notice a peak around the 4Hz frequency

spectrum if the signal is a speech.

Left is the power spectrum of a music signal

Righ is the one of a speech signal: peak at 4Hz frequency

This is a relevant feature but uneasy to extract computationnally: The peak can be present b

in the mass of the frequencies in the spectrum. The best way would be to analyse all the tr

set and watch for the amplitude of the 4Hz frequency, and determine the mean for each cla

discriminate the two groups by thresholding or Gaussian classifier. We did not first comple

codes this way to keep our fastness computation. We limited ourselves in analyzing if the

peak was stronger than the both 3 and 5Hz’s ones.

Selection of the featuresWe have already said that the more we used a great number of features, the better it is

improving a classifier algorithm. So that we were tempted to select all of the previous fea

One constraint, and not the less, is the Matlab environnement: The processes need much m

when running and we can sometime have a lack of memory. Matlab is either a line-coded

grammer that means some schedules are unavailable as they could be in C-Langage. Tho

tures thus forbidden are all Pitch estimation patterns. Let us precise also that some classifie

be confused by some bad features (that will be a good aspect of the Decision Tree classifier

be confused this way).

SpectrumAmplitude

Frequency

SpectrumAmplitude

Frequency

28


:

levant

peech

ration

tually

es and

ounds

an and

garding

n if the

lations

ucture:

ignals

fea-

We decided to select then the most relevant features in our point of view as it is listed below

· Spectral Centroid

· Spectral Flux

· Rolloff Point

· Residual Cepstrum

· Zero-Crossing Rate

· 4Hz Modulation

In addition, we used the variance of these six features as well. Computing them, it was re

that some results could be interesting (Standart Deviation are sometime quite different for s

and music) and it increase easily the number of features.

Improving of the featuresWe needed to verify and confirm the relevancy of each feature before to pledge in the integ

of them in the classifying process. So we wrote some codes calculating, storing and even

evaluating the values of the various features and their variance on a collection of speech

music samples and plotted diagrams to visually estimate if our choice was efficient.

Later in this report is the audio material to be described, but we can already precise that s

were studied frame-by-frame, each frame being 100ms long. So that we determined the me

standard deviation for each feature and each signal. The length of the frame was chosen re

on the various previous studies and averaging. It seems to be a reasonable choice eve

real-time systems at a commercial production level need faster thus shorter frames. Calcu

can be everly conducted later without many modifications inside codes.

The respective codes can be found in appendix. They respect an identical programming str

Reading of the patterns of records (sampling frequency, length, ...), storage of the s

frame-by-frame, filtering if needed, calculation of the mean value and variance value of the

29


from
ture and plotting of a two-dimension variance-to-mean diagram distinguishing the speech
the samples. These diagrams are also available in appendix, but below is a quick review

.

Red crosses are speeches

Blue circles are music samples

30


s, at

ently

lots of

Spec-

clas-

repre-

male,

ll kind

usic

priate

rding

Using

rtable

izen-

om-am-

Discussion. All the features we decided to integrate are relevant. Looking at the diagram

least one of the two dimensions are distinct for music and speech.

The plotting allowed us to supress some of the signals in our material which were evid

unadapted (the plot was located far out of the usual range).

We can already notice that the Spectral Flux seems to be a good discriminator in that way p

each class are not really mixed, and that looking at the centers of mass of the diagrams, the

tral Centroid and the Zero-Crossing Rate are effectively correlated.

The Audio MaterialWe tried to cover the most largest panel of audio signals. Even if we limited ourselves on the

sification of the sounds into only two categories, music and speech, the material has to be

sentative of the audio range and it can be later used for experiment in finer classification (fe

male, rock, pop, jazz, ...).

A good collection of music samples was already stored on the lab computers, representing a

of music: classical, pop, techno, new age, rock, jazz). The singers were also considered as m1.

But the Audio Research Group (ARG) was lacking of speech signals or they were not appro

to our purpose: it was mainly short vowels or alphabetical letters or they were speech reco

through phone line. So it was decided to increase the ARG material by creating some new.

specific texts produced by the NOKIA Research Center, we recorded people voice with a po

DAT recorder. The advantage of our laboratory in Tampere is that we usually meet quite cit

ships so we collected various tongues and accents as:

1. This was a trouble in that a singer is an human being and then produce sounds with his body so herespects the patterns of speech. Here can be a problem of class definition. Regarding on the possible cmercial products, in radio broadcast field, we defined the songs as a music signal and processed the sples in such a way. We can also notice that songs are rarely “a capela” and then it exists a musicalbackground.

Table 4: Various tongues recorded with the DAT recorder

English French Scotish

Estonian Irish Russian

Finish Chinese Polish

Spanish Japanese Italian

Armenian

31


cord a

s, and

avail-

length.

In spite of the real and sadly lack of female researchers in our laboratory, we managed to re

little collection of female voices.

Then these records were computerized with a sound editor software as *.aif-extension file

then converted into wave files to be processed on Matlab environment. This material is now

able in the ARG’s files. The signals are at least 2-second long and average a 10-second

After selecting them, we worked on 101 music samples and 135 speech samples.

32


cation

neasy

rchers

. We

al Net-

nted

ctured

uch

get the

meters

ll the

gnal to

n 101

assify as

ssified

s long.

y.Here

CLASSIFICATION

We explained previously that we wanted to choose the most efficient and common classifi

functions. The fact is that many classifiers are nowadays available and it may be sometime u

to make the better choice. Currently, the discriminant functions used by the various resea

are Gaussian function, Linear Discriminant Analysis, k-Nearest-Neighbors, Neural Network

implemented two of them (Gaussian and k-Nearest-Neighbors), tempted to process a Neur

work algorithm but failed (Matlab software presented a too much lack of memory), impleme

one another based on Learning Vector Quantization, and worked on a new one: Tree-stru

classifier.

I do not intend to describe precisely the way of working of these classification functions (m

has been written on them) but below is an abstract of those we used. I advise the reader to

books quoted in references1&2 if he wishes to get more informations on the subject.

First of all, how to implement the classifiersWe know that the more we have samples in our training set, the more valuable are the para

of the discriminant function and the more efficient is the classifier. So the idea is to use a

samples we have, but one, as the training set. This last one is thus used as a “unknown” si

be classified. Then we iterate this process through all the material. That mean that, as we ow

music samples and 134 speech samples, we thus evaluate the parameters 235 times and cl

much signals. Then we verify how many music samples then speech samples are well cla

and calculate the rate of success.

As we previously precised, the signals are studied frame-by-frame and each frame is 100m

Gaussian discriminationWe used the Multinormal Gaussian classification assuming that each class was equally likel

is the probability of being of a vectorr:

1. “Machine learning”, Tom M. Mitchell, The Mac-Graw Hill Companies, Inc.2. “Pattern Recognition and Neural Networks”,B.D.Ripley, University of Oxford, Cambridge University

Press.

33


first

-

ot so

l-time

e pat-

t fea-

eature,

where µ is the mean andσ the standard deviation. These parameters are estimated by

improving the function through a training matrix1. Then to classify a unknown signal, we calcu

late the two probabilities of being music or speech and adopt the greatest.

This algorithm has a lot of advantages: First, it is simple to integrate. The codes2 are very easy to

write (it concerns just one equation to solve), and then, the computering cost is quite low (n

much loops). Regarding on the possible commercial applications, with the constraint of rea

systems, it may be one of the best processes.

The obtained results confirm our feeling. The calculation timelap depends of course on th

terns of the Unix stations but it already indicates that it is the most fast process we studied.

We used the whole set of features first. Below are the various rate of success.

We then thought about the possibility to optimize the algorithm using only the most relevan

tures. Consequently we launched the process independently, studying the relevancy of f

one after one. The results are tabled following:

1. A Training Matrix is a M*N matrix, N known samples as columns and M features as rows.2. code available in appendix

Table 5: Rates of Good Classification with the Gaussian method

global rate 92,34%

music rate 94,06%

speech rate 91,05%

P r µ σ, ,( ) 1

σ 2π--------------e

r µ–( ) 2σ2( )⁄=

34


y dur-

of the

strum.

inte-

nearest

cisely,

wo

.

We can verify that the modulation 4Hz feature is not very relevant as we estimated previousl

ing the mathematical despcription.

Regarding on this values we tempted to classify the signals by using just a combination

Spectral Flux and Spectral Centroid, or a combination of these two feature added to the Cep

The result is that neither combination produces better rates of classification. The most we

grate features, even with a low relevancy, the best is the classifier.

k-Nearest-Neighbors classification

The algorithm assumes all instances correspond ot points in a n-dimensional space. The

neighbors of an instance are defined in terms of the standard Euclidian distance. More pre

let an arbitrary instance x be described by the feature vector

(a1(x),a2(x),...,an(x))

where ar(x) denotes the value of therth attribute of instance x. Then the distance between t

instances xi and xj is defined to be d(xi, xj), where

Table 6: Rates of Good Classification with the Gaussian method, feature by feature

Feature Music(%) Speech(%) Global Rate (%)

Centroid 90,1 53,73 69,36

Rolloff 94,06 21,64 52,77

Flux 87,13 58,96 71,06

Cepstrum 70,3 59,7 64,26

Modulation 4Hz 7,92 87,31 53,19

Zero-Crossing Rate 93,07 29,1 56,6

d xi x j,( ) ar xi( ) ar x j( )–( )2

r 1=

n∑≡

35


earest

patible

t

ter, the

s from

ns of

s just

Rule:

So the class of an unknown sample is the most common value of the class of the k n

instances regarding on the above expression.

We first launch the process with the value 3 for the parameterk. The results are those below:

They are quite good rates of success. So we can consider the kNN methods is very com

with the speech/music discrimination.

We attempted to increase these values by modifying the parameterk. We tried several values bu

the casek=3 seems to be a good compromise because when we rise the value of the parame

classifier looks too far away from the plot to be considered and integrates too much instance

both the classes. The “unknown” instance is then drowned in the whole stack.

Neural NetworkThis kind of process was very famous in the 80’s when artificial intelligence and the patter

the brain’s way of working was studied. The details of this process are very complex so let u

say that the parameters of the algorithm are determined following the Perceptron Training

Table 7: Rate of Good Classification with the kNN classifier

global rate 94,47 %

music rate 89,11 %

speech rate 98,51 %

Table 8: Rate of Good Classification with the kNN classifier feature by feature

Feature Music(%) Speech(%) Global Rate (%)

Centroid 78,22 91,4 85,53

Rolloff 56,44 84,33 72,34

Flux 75,25 82,84 79,57

Cepstrum 72,28 88,06 81,28

Modulation 4Hz 97,03 11,19 48,09

Zero-Crossing Rate 75,25 85,82 81,28

36


lasses)

auses

There

e

seems

y, the

other

o-

hat is

or, and

The various ways to go from inputs (in our case, samples) to outputs (music and speech c

are multiples and weighted. So the learning problem is to determine a weight vector that c

the perceptron to produce the correct output value for each of the given training examples.

are several algorithms to determine these weights and we are not to describe them there.

Global aspect of the Neural Networks structure

We tempted to use Matlab predefined functions1 to conceptualize such a discriminator. W

quickly failed because such functions are supposed to work on very simple cases and it

there are a lot of iterations during the Perceptron Training process (After a 48-hour dela

processes were still running). We abandonned the use of Matlab functions and tempted an

method:

Learning Vector Quantization2 & 3. Vector Quantization is a classical method in signal pr

cessing to produce an approximation to the distribution of a single class by a codebook (t

also called the training set). Each incoming signal is mapped to the nearest codebook vect

that vector sent instead of the original signal.

The major advantages of formulating the VQ problem in terms of neural networks are:

1. Neural Networks functions, Demo Application, Matlab 5.3.2. “Possibilistic Clustering in Kohonen Networks for Vector Quantization” by D. Ghosh - A. P. Shivaprasad,Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore, INDIA3. World Wide Web’s URL: http:\\www.ee.eng.ohio-state.edu

37


VQ.

ithms

Q).

in the

lowing

- Since neural networks have parallel computer architecture, they may be used for real time

- The various algorithms available for training neural networks may be used forself-organization

of codebook in VQ.

- Since most of the neural network learning methods are adaptive, the VQ training algor

derived from these allow for the designing of codebooks in adaptive vector quantization (AV

Thus we build a new neural network with such a criterium.We used a code already existing

laboratory computers whom purpose is the LVQ-based classification.

Global aspect of the Learning Vector Quantization

Below are the various results:

We were not able to increase those rates and we prefered to spend more time on the fol

classifier.

Table 9: Rates of Good Classification with the LVQ-based classifier

global rate 45,11%

music rate 35,64%

speech rate 52,24%

1.The original patterns A,B,C are divided into vectors X

2. These vectors X are compared to a table

3. The index of the best matching pattern in the code-book is used. The code-book can be constructed in advance or dynamically

Index 1

Index 2

...

Index nIndex i

38


f node,

e fea-

e of the

e tree,

to the

oted at

algo-

es. In

assifier.

node is

to the

amples

Decision Tree algorithmDecision trees classifies instances by sorting them down the tree from the root to some lea

which provides the classification of the instance. Each node in the tree specifies a test of som

ture value of the instance, and each branch descending from that node corresponds to on

possibles values for this feature. An instance is classified by starting at the root node of th

testing the feature specified by this node, then moving down the tree branch corresponding

value of the feature in the given example. This process is then repeated for the subtree ro

the new node.

Usual aspect of a Decision Tree

Most algorithms that have been developed for learning decision tree are variations on a core

rithm that employs a top-down, greedy search through the space of possible decision tre

other words, it searches at each level or node, the most conveniant feature to be the best cl

Then this feature is used as the test at the root node of the tree. A descendant of the root

then created for each possible value of this feature, and the training examples are sorted

appropriate descendant node. The entire process is then repeated using the training ex

Root Node

Training Set S

All Features considered

Node 1.1

Training subSet S1.1

All Features but F0

Node1.2


All Features but F0

Node 2.1


Feature Fi

Node 2.2


Feature Fj

Node 2.3


Feature Fk

Node 2.4


Feature Fl

Node 2.5


Feature Fi

Node 3.1


Feature Fk

Node 3.2


Feature Fj

Class=1 Class=2

Class=1 Class=2

Class=1 Class=2

Class=1 Class=2Class=1 Class=2

Class=1

subSet is composed of thesamples which have thesame value for the max-

gain feature

39


ree. We

mea-

ding

n

re of

s

he

al by

eter-

s of

to one

lues of

is upon

rain-

alue).

associated with each descendant node to select the best feature to test at that point in the t

need thus a criteria to select the appropriate feature: the Entropy. This statistical property

sures theinformation gain, or how well a given feature separates the training examples accor

to their target classification. Here the Entropy of a collection S is:

wherepmusicis the proportion of music samples andpspeechis the proportion of speech samples i

S. In all calculation involving Entropy we define 0.log0 to be 0.We give Entropy as a measu

the impurity of a set of training samples relative to a feature. We thus calculate theInformation

Gain of a feature A relatively to the whole set S of a training signals in this way:

Sv is the subset of S for which feature A has valuev and Values(A) is the set of all possible value

for feature A. We determine thisInformation Gainfor each feature expected in the subtree at t

considered level and select the one which the higher value.

To sum up the process: we build a Decision Tree and then we classify an “unknown” sign

conducting it through the Tree looking at its features’ values. To build the Tree, we have to d

mine theInformation Gainfor each feature at each level as explained previously. So the value

the features for the whole stack of samples have to be discretized, so that a signal belongs

of the subclasses of the investigated features. First we attempted to distribute the various va

each feature between two subclasses: High values and low values, considering that all that

the mean of the values of the feature is High value and all that is below is Low value. A new t

ing matrix was thus completed only composed of 1 and 0 (1 for High value and 0 for Low v

Entropy s( ) pmusic pmusic( )log× pspeech pspeech( )log×–≡

Gain S A,( ) Entropy S( )SvS

--------Entropy Sv( )v Values A( )∈

∑–=

40


n bal-

ssian

ature,

everal

train-

l in

that it

The results are not so relevant as we wished, without being unacceptable:

We though that our way of discretizing the features’values was too much squarred. We the

anced for still discretize the feature’s values into two subclasses but with the help of the Gau

classifier as already used previously. Following are the results:

It seems not to be performant either but we detected an other failure: we study only one fe

one by one at each level in the Tree. It may be more relevant to examine a combination of s

features to much specify the discrimination at each level. So we worked, building adequate

ing matrix with the help of Gaussian classifier and combining feature. This work is stil

progress and we did not have the results when this report went to reproduction. But it seems

is an expensive computationnally way of working.

Table 10: Good Classification with Decision Tree method

Global Rate of Good Classification 85,11%

Rate of Good Classification of Music Signals 85,82%

Rate of Good Classification of SpeechSignals 84,16%

Table 11: Good Classification with Discretization of the Feature by Gaussian method

Global Rate of Good Classification 71,91%

Rate of Good Classification of Music Signals 68,32%

Rate of Good Classification of SpeechSignals 74,63%

41


pose),

fier and

classi-

n is a

But we

Gaus-

, our

e evo-

assifi-

ted by

It

e other

DiscussionBecause it is a “free” research (this job has not been done for a precise commercial pur

there can not be some answers as “this classifier is the best we can use”. Gaussian classi

k-Nearest-Neighbors are the quickest algorithms and actually give the best results of good

fication. Neural Networks method must be studied by some experts: this way of classificatio

very complex one and it may be necessary to specify this study as an entire research topic.

are to not forget the Decision Tree method: that is right our results are less performant than

sian and kNN’s ones, added to the fact that it is computationnally more expensive. Anyway

work was just a preliminary to a more structured and specific possible project. Some possibl

lutions have already been given and can be completed as a future work. Plus this kind of cl

cation is worth to be concentrated on because of its specifications: Features are selec

calculating theirinformation gain. Thus features with low relevancy are almost “forgotten”.

classifies not only the signals but also the features themselves that has not been done by th

classifiers.

42


id. We

f clas-

an be

y rele-

% and

with

easy

nnally

Tree

study

e per-

e con-

tion

being

nals.

easy to

peech

unica-

e that

re rel-

CONCLUSION

Our duty was to present a wide-open panel of speech / music discriminators, and so we d

conducted a large study on feature extraction on a first step and went through several king o

sifiers then.

To reach a rate more than 95% of good classification reveals that good discriminators c

implemented further. Spectral Centroid and this particular feature Cepstrum seem to be ver

vant features (kNN classifier gives independant rate of good classification as high as 85,53

81,28%, respectively). Spectral Flux is not to be forgotten with 71,06% of good classification

the Gaussian classifier and 79,57% with the kNN classifier. Classifiers are not to be

improved: Gaussian and kNN classifiers give the highest rates and are very computatio

cheap. Neural Networks are to be given up because of a too much complexity. But Decision

classifier is up to better results as given in this report. As we already said it is worse to be

because this classifier also improve the “purity” of the features with theinformation gaincalcula-

tion. Here is the future work to be done.

Actually, it appears that much work is still to be done. Surely, classifiers themselves has to b

fected but this is the usual way of proceeding, and many independant and parallel works ar

ducted at this time. Actually, the main part of the future work will concern the feature extrac

as well: modulation 4Hz is well characterizing the patterns of speeches or at least human

voices but it is still a not enough categorical feature to allow the exact classification of the sig

We have to increase the volume of such attributes. We have to do so that these patterns are

extract, fast to calculate, looking on the main part of the possible commercial applications. S

processing is a main topic of interest in the audio processing field and the rise of telecomm

tion allows actually money to be distributed and projects to be conducted. It may be possibl

new patterns will be implemented as cepstrum was at its time but let’s not forget that the mo

evant features are those on which many precise and deep studies were leaded on.

43


GLOSSARY

ARG. Audio Research Group

AVQ. Adaptative Vector Quantization

DAMA. Content on Audio and Music Analysis

DAT. Digital Audio Tape

DMI. Digital Media Institute

DSP.Digital Sognal Processing

DT. Decision Tree

EER. Equal Error Rate

HMM. Hidden Markov Model

ICASSP. International Conference on Acoustics, Speech and Signal Processing

IEEE. Institute of Eletrical and Electronics Engineers

kNN. k-Nearest-Neighbors

LVQ. Learning Vector Quantization

NN. Neural Networks

RMS. Root Mean Square

SPL. Signal Processing Laboratory

TUT. Tampere University of Technology

VQ. Vector Quantization

ZCR. Xero-Crossing Rate

44


Appendices

45

Documents

Acknowledgements - · PDF fileSpeech/Music Discriminator 1 Acknowledgements I wish to espress my sincere gratitude to those who have helped me make this six-month work possible