Upload
vudang
View
217
Download
3
Embed Size (px)
Citation preview
Speech/Music Discriminator
work
me
work.
sup-
ts on
ory for
with
ro-
Acknowledgements
I wish to espress my sincere gratitude to those who have helped me make this six-month
possible.
A special thank to Mr. Ioan TABUS, head of the international programs, for having welcomed
in the Signal Processing Laboratory. Thanks also for his advice and nice cooperation in my
I am very grateful to Anssi KLAPURI, my thesis advisor, for his guidance and continuous
port, and also for his patience and his good mood. I thank him also for his useful commen
this thesis.
Thanks to all the members of the Audio Research Group and the Signal Processing Laborat
providing such a good atmosphere, and to all the people I met in Tampere University and
whom I spent a nice time.
Thanks also to Mr. REFREGIER and to Mr. KAKOURIDIS who permitted this exchange p
gram between my school and the Signal Processing Laboratory.
1
Speech/Music Discriminator
rchi-
devel-
com-
on is
ro-
ental
erna-
sev-
TAMPERE UNIVERSITY OF TECHNOLOGY
http:\\www.tut.fi
Tampere University of Technology provides high-level education in technical science and a
tecture, engages in research and product development and furthers technical and scientific
opment in Finland.
In addition, TUT serves public institutions as well as industry by means of shared projects,
missioned research and diploma projects.
TUT offers three one-year International University Programs, whose langage of instructi
English. The programs are in the following fields: Biomedical Engineering, Digital Signal P
cessing, and Material Science. English is due to be introduced in Architecture, Environm
Enginering ansd Mechanical Engineering as the langage of tuition. Approximately 200 int
tional students attend Tampere University of Technology.
TUT offers programs covering a wide spectrum of technological knowledge and skills. The
eral departments are:
· Department of Architecture
· Department of Mechanical Engineering
· Department of Electrical Engineering
· Department of Civil Engineering
· Department of Material Science
· Department of Information Technology
· Automation Department
· Department of Industrial Engineering and Managment
· Environmental Department
2
Speech/Music Discriminator
rma-
tech-
This
logy
g the-
bility
e. This
tly.
al and
hnol-
digital
circuit
jects
in the
gth in
tributed
DEPARTMENT OF INFORMATION TECHNOLOGY
http:\\www.cs.tut.fi
· General InformationData processing - the combination of computer and information technology - has made info
tion technology into a discrete whole, which is required as a separated area of expertise in
nology and as a provider of indispensible basic skills for other areas in technology.
development is reflected by the differentation in 1993 of the Faculty of information Techno
from the Faculty of Electrical Engineering.
· Mathematicsin addition to teaching, the Institute of Mathematics is actively engaged in research on codin
ory, hypermedia, mathematical control theory, fuzzy mathematics, optimization, computa
theory, and other areas of mathematics.
Computer-aided hypermedia based teaching methods are being developed at the institut
hypermedia teaching can be reached at a variety of levels and facilitates studying significan
· Signal ProcessingThe Signal Processing Laboratory is the eleading research unit in Finland in the field of sign
image processing, offering teaching of the highest order in many subareas of information tec
ogy.
The most central research areas are signal processing algorithms and applications,
image-processing methods, speech coding and processing, digital audio, digital processing
design, and VLSI implementation. The institute is participating on the EU research pro
RACE, COST, EUREKA, and ESPRIT.
· Software systemsThe importance of software systems is based on the central role of software engineering
applications of information technology and in product development. Central areas of stren
teaching and research are modern software methods and tools, design of embedde and dis
3
Speech/Music Discriminator
serve
muni-
sys-
ct on
uction
systems, formal methods, and compute networks. Activities, which have been developed to
domestic industry, have achieved international recognition.
· TelecommunicationsActivities focus on the central area in telecommunications of the 1990’s broadband telecom
cation networks, digital transmission techniques, and the new digital mobile communication
tems. Together with several other institutes of the University, on extensive research proje
broadband network technology has been set up. This FASTER project includes the constr
on an ATM text network, research, and industrial projects in the field of ATM engineering.
4
Speech/Music Discriminator
1988.
nsist-
tems
parts
rch and
nal
select
s been
orship
a have
profes-
search
sig-
rates
about
y stu-
ing or
THE SIGNAL PROCESSING LABORATORY
http:\\sigwww.dmi.tut.fi
· IntroductionThe Signal Processing Laboratory at Tampere University of Technology was inaugurated in
The Signal Processing Laboratory belongs to the Department of Information Technology co
ing of five laboratories.
The laboratory can be roughly divided into Digital Signal Processing and Computer sys
parts. The most important change in 1996 was the modernization of the curriculum of both
to better reflect the current trends in research and industry. In signal Processing both resea
teaching can be divided into the following four subtopics: Digital filtering, image and video sig
processing, audio and speech and multimedia. Those majoring in signal processing lab now
one (or more) of these topics. The coverage of signal processing all important areas ha
made possible by founding two posts of Associate Professors. After founding a new profess
in Computer Systems from the beginning of 1996, both the research and teaching in this are
been expanded and modernized. This trend has been strengthened by founding a part-time
sorship and an associate professorship.
Numerous researchers are working in both applied and basic digital signal processing re
projects which include multimedia, digital video, digital audio, biomedical signal processing,
nal processing for telecommunications and VLSI system design. The laboratory co-ope
closely with industry and international research groups. The laboratory personnel comprises
120 staff members.
Lectures in digital signal processing and basic computer systems are given to all universit
dents. After their basic studies, students specialize in either or both Digital Signal Process
5
Speech/Music Discriminator
jects. In
ogram
ogram
ree pro-
clud-
nfer-
eloped
M-
for
of the
ical
and
ther is
s.
ess-
tional
degree
ies in
ilities
bora-
US
Computer Engineering. Several post graduate level courses are also given in the above sub
1989, an International Program in Digital Signal Processing was started. Courses in this pr
are given in English. Some 20 foreign and Finnish students are selected to the one-year pr
each year. The program has been a great success, and it was extended to a master’s deg
gram in 1992.
In 1996, the laboratory personnel participated in a large number of international activities in
ing the organization of international conferences. The number of international journal and co
ence papers presented was 78 in total. During 1996, also international cooperation dev
positively with participation now in all major European programs: ESPRIT, COST, RACE, CO
ETT, EUREKA, ERASMUS, and TEMPUS. In early 1997, Tampere International Center
Signal Processing was founded in order to enhance the international visibility and contacts
laboratory.
· Educational Activities.The Signal Processing Laboratory is a department of the Technolog
University of Tampere. Consequently, an important part of its activities is in teaching M.Sc.
Ph.D. level students.
The Laboratory has two major teaching areas. One is, naturally, signal processing, and the o
digital systems including digital circuit design, computer architectures and integrated circuit
One Speciality of the Laboratory is an International University Program in Digital Signal Proc
ing. This is a one-year course package in the field of signal processing offered to interna
M.Sc. and Ph.D. level students. The teaching language is English. Students holding a B.Sc.
or equivalent in the area of electrical engineering have the possibility to continue their stud
the Laboratory up to a M.Sc. and further to Dr.Tech. degrees.
The international program, supporting student ans teacher mobility, offers important possib
for students coming to Finland or Finnish students going abroad. The Signal Processing La
tory is actively participating, for instance, in the European Community TEMPUS and ERASM
programs.
6
Speech/Music Discriminator
· Research ActivitiesThe main research areas are the following:
Speech and Audio· Enhancement of Speech in Harmonic Noise
· Collection of a Speech Database
· Audio Coding
· Angle of Arrival Estimation Using Small Sensor Arrays
· Audio Content Analysis
Video Signal Processing· Multibase Transform Coding
· Disparity Field Estimation for Video Sequences
· Motion Vector Improvement in Block Matching Algorithms
Nonlinear Signal Processing· Filter Structures
· Adaptative Algorithms
· Optimization Methods
· Implementations
· Document Imaging
Transforms and Spectral Techniques· Design
· Spectral Methods in Nonlinear Filtering
· Spectral Methods in Classification
7
Speech/Music Discriminator
Image Processing· Lossless Image Compression Using Predictive Boolean and Stack Filters
· Image Segmentation
· Image Component Labeling
· Color Image Processing with B-Spline Modeling
· Nonlinear Color Image Restoration
Interactive Multimedia Technology· Broadband Multimedia Systems
· Multimedia User Interfaces
· Digital Video Quality
· VRML (Virtual Reality Modelling Language)
· Tools for Information Retrieval and Organization
· Integrated Publishing in Multimedia Networks
VLSI Circuit Design for Signal Processing· DSP Processor Design
· Logarithmic Arithmetic Design
· DSP Core Design
Parallel and Multimedia Systems
· Parallel Neurocomputer Systems
· Multimedia Enabling Technology
Embedded Systems Design Methodology· System modelling
· Formal Specification
· HW/SW co-design
· Digital Signal Processor Implementations
8
Speech/Music Discriminator
Learning Systems· Neural Networks
· Speech Coding
· Tracking algorithms
Medical Informations Processing Systems· Computerized Vigilance Analysis
· EEG Analysis with Neural Networks
· Medical Device Communication Standardization
· Processing of Medical Isotopic Images
· 3-D Visualization of Anatomic Data
Design and Implementation of Digital Filters and Filter Banks for Various Applications· Synthesis of IIR Filters
· Synthesis of Computationally Efficient FIR Filters
· Optimization of Digital Structures for VLSI Implementation
· Design of Decimators and Interpolators
· Synthesis of Filter Wavelet Banks
· Filters for Telecommunication Applications
· Other Applications
9
Speech/Music Discriminator
by the
orking
lication
try of
:
at and
ABOUT SPEECH AND AUDIO
http:\\www.cs.tut.fi/sgn/arg
Research related to speech and audio in the Signal Processing Laboratory is conducted
Audio Research Group. This group was refounded in 1996 and is composed of 20 people w
in speech and audio projects. Research topics include both basic research and more app
oriented research which is financed by industrial companies,the European Union, the Minis
Finland and the Finnish Academy. Below is a short description of the main projects of 1997
· SpeechCollection of finnish speech databases for speech recognition development (SpeechD
SpeechDat-Car)
Finnish speech recognition
Wideband speech coding
· MusicMusic content analysis
Automatic transcription of music
· Spatial AnalysisAngle of arrival estimation for pulsed sound waves
Blind deconvolution for speech/audio signals
10
Speech/Music Discriminator
· Audio CodingInvestigation of possibilities in using various transforms in audio coding
Research on advanced quantization, coding, and bit allocation methods
Studies of advanced utilization of psychoacoustical phenomena
· Non-Linearity in AudioConstruction of a subwoofer controlled with a digital signal processor
Research of a digital power amplifier
11
Speech/Music Discriminator
t anal-
h and
selec-
n two
eas of
s first
omising
algo-
mance
based
imu-
ABSTRACT
TAMPERE UNIVERSITY OF TECHNOLOGY
Department od Information Technology
Signal Processing Laboratory
PHILIBERT, ARNAUD B.S.: Speech / Music Discriminator
Engineering Project, 32 pages
Examiner: Prof. Ioan TABUS, Bernard GILLY
March - August 1999
Keywords: speech, music, signal classification, speech / music discrimination, audio conten
ysis, speech processing
The study concerns the automatic classification of audio signals into two classes: speec
music. According to the signal processing practice, this requires extraction, evaluation and
tion of features and then selection of a suitable classification method.
These two subproblems were studied and are reviewed. Emphasis of this study is laid o
points: on literature review and on algorithm development.
A literature review was conducted on music and audio classification and several related ar
interest. An appropriate decomposition of the problem and the selection of an approach i
considered. Then the state-of-the-art of the research is represented and discussed, and pr
directions for further work are indicated.
The main and maybe the most interesting part of this study concerns the development of
rithms first to calculate the features used and then to conceptualize a classifier. The perfor
of the method is evaluated by applying it on a database created in the course of this project,
on various kind of music and different langages. All the algorithms were implemented and s
lated in Matlab environment.
12
Speech/Music Discriminator
eople
ltina-
oncen-
take
allow
sys-
erate
panies’
opics.
d: We
present
pre-
ed us
k. We
ir rel-
n itself:
ignals,
and
ack-
the
recise
clude
INTRODUCTION
Due to the increase in volume of the telecommunication field (the European proportion of p
owning a mobile phone averages 30%), a lot of small firms appeared to supply the huge mu
tional enterprises. They evolute as satellite around leaders of telecommunication systems, c
trating their effort on bringing solutions to them. Universitary laboratories did not forget to
benefit of the situation and the Department of Information Sciences met a restructuration to
a more competitive profil, faced to the Finnish giant builder of wireless telecommunication
tems:NOKIA. The Signal Processing Laboratory was created in this way so that it could gen
some wide-open research studies. The advantage of such a laboratory compared to com
research centers is that it is usually almost “free” concerning the choice of its research t
Thus all the aspect of a problem are studied. And this is the way my project was conducte
were not to build one discriminator, the best one, the fastest one or the cheaper one, but to
a panel of possible classifiers.
Not being the only group which interests itself on such a topic, we first draw a roadmap of the
vious studies, collecting all kind of informations: features, parameters, classifiers... It allow
to verify some feelings, correct some others, and list the features upon which we would wor
then began to write codes extracting the features from an audio signal, in order to verify the
evancy. This was done through a first set of samples. The second step was the classificatio
We needed some material to improve the codes and we spent some time colecting audio s
from all sorts. Then started the several improving of the classifiers. We selected four of them
conducted a quick study on each. The following pages respect this chronology.
Even if my work is quite similar to a Master Science thesis, I did not forget my engineering b
ground when I wrote this report. That means I did not enter too far in the details. I specify
main decisions, explain the choice of the parameters, present the main results, and I just p
where more informations can be found. The review notes provided here are intended to in
just enough material to render the work comprehensible.
13
Speech/Music Discriminator
terest.
on. The
en we
r clas-
bove-
Group
rature
rchers
king
rmation
m in
the dif-
uts and
much
l. The
most
LITERATURE REVIEW
A literature review was conducted on speech/music discrimination, and related areas of in
These are speech features, music features, feature extraction, and temporal segmentati
meanings of these terms were given in the introduction.
In this chapter, we first describe the methods that were used in the search of literature. Th
briefly describe the state-of-the-art and history of designing speech / music discriminators o
sifiers. Finally, we summarize and discuss the most important references we found in the a
mentioned research areas.
MethodsThe research in the Signal Processing Laboratory and especially the Audio Research
includes audio content analysis for several different purposes. Thus the main part of the lite
was already selected. In addition to that, a major facilitating observation was that most resea
give their publication indices on-line on the World Wide Web. Finally, some laboratories wor
in the same areas of interest usually send their papers to others researchers thus the info
spreads quickly and can be stored.
After becoming familiar with the relevant topics, we started searching for publications on the
the abovementioned ways. We collected many references, then we looked for the papers in
ferent databases such as IEEE, seminars as ICASSP, and the web sites of the main instit
universities. We took also a look at some patents written on the subject.
Our scope was efficient enough but limited by several factors. There is sometimes too
resources on a unique point of interest and it can be uneasy to select what would be usefu
other problem is timing: To surf on the web and through the IEEE papers takes time and the
relevant papers were those which some others researchers recommended to us.
14
Speech/Music Discriminator
usic
on and
ction
over a
at he
error
it was
hey
lready
.
Related workThe Signal Processing Laboratory is obviously not the only institut working on a speech / m
discriminator. Many researchers, and not the worst, leaded studies upon audio classificati
feature extraction. Eric Scheirer’s discriminator1 and the one of Saunders2 are among the most
reliable classifiers. The first one, which is the most related to our work, reports on the constru
of a real-time computer system capable of distinguishing speech signals from music signals
wide range of digital audio input. This will be discuss later but we can already present th
managed to implement a classifier with only 5.8% error on a frame-by-frame basis and 1.4%
when integrating long segments of sound3.
Some others research centers we can not list here are actually working on the subject and
useful to read their results.
Roadmap to the most important referencesBellow we list the most important papers from our point of view, written on the subject. T
allowed us to determine which features could be used, and repertoriated the difficulties a
met. For each article, we summarized the contents and listed the main features.
1. “Construction nad Evaluation of a Robust Multifeature Speech/Music Discriminator” by Eric Scheirerand Malcom Slaney in theProceedings of the 1997 International Conference on Acoustics, Speech andSignal Processing,Munich, April 21-24, 1997.
2. “Real-Time Discrimination of Broadcast Speech / Music” by John Saunders, A Lockheed Martin Co.,Nashua, NH 03061.
3. The frames were 2.4 seconds long.
IEEEICASSP
World Wide Web
Others seminarsPatents
15
Speech/Music Discriminator
of dis-
nders,
s used
xperi-
feature
d by a
mples,
tabase
ver the
al Error
ude.
duced
vari-
Rate,
s very
ndow
ature
a Mul-
nother
amples
e able
" A Comparison of Features For Speech, Music Discrimination" by M.J. Carey, E.S. Parris
and H. Lloyd-Thomas.
The authors begin with a summary of the main approaches previously taken to the problem
criminating between speech and music signals. They list all the features investigated by Sau
Zue and Spina, Scheier and Slaney. They insist on the fact that none of the works so far ha
pitch and amplitude features explicitly and thus they concentrate on them. They carried e
ments on amplitude, cesptra, pitch and zero-crossings. In each case, the derivative of the
was also used. Through the database, the distributions in the feature space were modelle
Gaussian Mixture Model.The database was composed of most than a thousand of 10s sa
band-limited to 4kHz and sampled at an 8kHz rate, and the frames are 10 ms long. The da
assembled for these tests includes speech from thirteen langages and music from all o
world.
The best performance resulted from using the cepstra and delta cepstra which gave an Equ
Rate (EER) of 1.2%. This was closely followed by normalized amplitude and delta amplit
Pitch and delta pitch gave an EER of 4% which was better than the zero-crossing which pro
and EER of 6%.
" Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator" by Eric
Sheirer and Malcom Slaney.
This is one of the most robust discriminators at the moment. It examined 13 features: 5 are
ance features (variances of Rolloff Point, Spectral Centroid, Spectral Flux, Zero-Crossing
and Cepstral Residual Magnitude). He explains that if a feature has the property that it give
different values for voiced and unvoiced speech, but remains relatively constant within a wi
of musical sound then the variance of that feature will be a better discriminator than the fe
itself. Other features are such as Pitch estimation and Low-Energy. Sheirer decided to use
tidimensional Gaussian Maximum estimator, one based partially on Decision Trees, and a
based on k-Nearest-Neighbors. Database is composed of 80 15-second-long audio s
recorded at a 22.05 kHz sampling rate from an FM tuner.
He obtains some all-features classifications with an error rate around 6.8 +/- 3.5%. It may b
to discriminate voiced from unvoiced speech.
16
Speech/Music Discriminator
rallel
y con-
te that
rithm
tors to
rpose
mpu-
itation
ders are
tech-
a FM
ents
plause.
d of
ifica-
peech,
als are
In another article1, Eric Sheirer presents a method for using bandpass filters and banks of pa
comb filters to analyse the tempo and extract the beat from music signals. Scheirer is onl
cerned with beat and tempo and avoids what is relative to rythm. He manages to demonstra
the performance of his algorithm is similar to the performance of human listeners. This algo
is based on the one implemented by Large and Kolen, in that it uses a network of resona
phase-lock with the beat of the signal and determine the frequency of the pulse.
" Real-Time Discrimination of Broadcast Speech/Music" by John Saunders.(1996).
This article contains details of techniques that Scheirer borrowed but not explained. The pu
was to discriminate speech from music on broadcast FM radio. Thus, it requires minimal co
tation. Saunders explained a few of features which can be used, as tonality, bandwidth, exc
patterns, tonal duration and energy sequences. But the main features investigated by Saun
energy, and zero-crossing rate. A sample rate of 16kHz was chosen for this discrimination
nique. The database was composed of a block of 2.4 seconds containing 150 frames, from
tuner. A Multivariate-Gaussian classifier was used.
Experimental results show performance approaching 98% correct classification.
" A Fast Audio Classification from MPEG Coded Data" by KDD R&D Labs and University of
Electro-Communications.
Two points differenciate this algorithm from the others. First, it is based on the silent segm
detection Remaining segments are then classified into three types: music, speech and ap
Secondly, it uses Bayes discriminant function for multivariate Gaussian distribution instea
manually adjusting a threshold for each discriminator. This gives a 90% rate of good class
tion.
" Hierarchical Classification of Audio Data for Archiving and Retreiving" by Tong Zhang and
Jay Kuo.
The authors implemented a system able to classify and retrieve audio signals. First, s
music, environmental sounds and silence are discirminated. Second, environmental sign
1. "Tempo and Beat Analysis of Acoustic Musical Signals" by Eric D. Scheirer (1996)
17
Speech/Music Discriminator
l like-
ses a
main
tal Fre-
e pro-
l to see
and
nd
HMM
how
H.
d
s
further classified into subclasses using time-frequency analysis, and Hidden Markov Mode
lyhood (a model was built for each class). In the third phase, the system proces
query-by-example audio retrieval by computing the various probabilities. There are two
classes of features: Physical features such as Energy, Zero-Crossing Rate, Fundamen
quency and Perceptual features (Timbre and Rythm).
Preliminary experiments showed that accuracy rate of over 80% can be achieved with th
posed fine classification method.
" Recognition of Music Types" by H. Soltau, T. Schultz, M. Westphal, A. Waibel from Inter-
active Systems Laboratories.
This paper focuses on music classification. No speech signals are considered, but it is usefu
how we can discriminate music types through the Hidden Markov Models method (HMM)
Explicit Time Modelling with Neural Network method (ETM-NN). The purpose is to index a
search music tapes in multimedia databases.
First signals need to be preprocessed in order to extract the cepstral feature, then launch a
analysis.
The following table listed all the feature used by the several discriminators and repertoriate
often it was used and by which researcher (referenced in the following table).
TABLE 1. References of the articles
reference article
1 “A Comparison of Features For Speech, Music Discrimination” by M.J. Carey, E.S. Parris and
Lloyd-Thomas.
2 “Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator” by Eric Sheirer an
Malcom Slaney.
3 “Tempo and Beat Analysis of Acoustic Musical Signals” by Eric D. Scheirer (1996).
4 “Real-Time Discrimination of Broadcast Speech/Music” by John Saunders.(1996).
5 “A Fast Audio Classification from MPEG Coded Data” by KDD R&D Labs and University of Electro-Com-
munications.
6 “Hierarchical Classification of Audio Data for Archiving and Retreiving” by Tong Zhang and Jay Kuo.
7 “Recognition of Music Types” by H. Soltau, T. Schultz, M. Westphal, A. Waibel from Interactive System
Laboratories.
18
Speech/Music Discriminator
TABLE 2. Features used into Discriminators
1 2 3 4 5 6 7
Amplitude ● ●
Delta Amplitude ●
Pitch ● ●
Delta Pitch ●
Four-Hertz Modulation Energy ●
Roll-Off of the Spectrum ●
Variance of the Roll-Off feature ●
Spectral Centroid ● ● ●
Variance of the Spectral Centroid ● ●
Spectral Flux ●
Variance of the Spectral Flux ●
Cepstral Residual ● ●
Variance of the Cepstral Residual ● ●
Cepstral feature ●
Pulse Metric ●
Zero-Crossing Rate ● ● ● ●
Standard Deviation of first order differ-ence of ZCR
● ● ●
Third Central Moment about the Meanof ZCR
●
Total number of zero-crossing exced-ing a treshold
●
Difference betwwen zero-crossing sam-ples above and below the mean
●
Number of Energy minima below a rel-ative treshold
●
Number of silent segments ●
Bandwidth of signal ●
hidden unit (HMM) ●
19
Speech/Music Discriminator
The features can be ordered in the following way1:
signal “brigthness” features
- Centroid and his variance
- Zero-Crossing Rate and his variance
- Roll-Off point
- Fundamental Frequency
Spectral Shape features
- Cepstrum
- Cepstral Delta
TABLE 3. Cross-references between researchers
Vertical refers to horizontal
Sch
eire
r &
Sla
ney
Sau
nder
s
Car
ey, P
arris
& L
loyd
-Tho
mas
KD
D R
&D
Lab
s
Sol
tau
& S
chul
tz
Scheirer & Slaney ● ● ●
Saunders ● ● ● ●
Zue & Spina ●
Povel & Essens ●
Desain & Honig ●
Large & Kolen ●
Longuet-Higgens & Lee ●
Rosenthal ●
Brown ●
Todd ●
Kedem ● ● ●
Bogert ●
1. This is just a preliminary characterization of the features. The entire definition would be given later.
20
Speech/Music Discriminator
ut we
l sim-
ally
plied to
Harmonic/Noisy features
- Cepstral Residual
- Pitch (relative to the “color”’s tone of the signal)
Amplitude Envelope features
- 4Hz modulation
- Pulse metric
-Rythm
- Energy
Commercial productsNowadays, none of the discriminators has yet been applied in a commercial system. B
already know that one of the application could be on broadcast FM radio. The computationa
plicity of the approach could lend itself to wide application including the ability to automatic
change channels when commercials appear. And anyway, the feature extraction may be ap
speech and music recognition and thus be widely used.
21
Speech/Music Discriminator
The
we can
s, we
ertori-
g their
lculate
envi-
rithm,
traints
l-time
is not
terials
ature
sifiers
teams
several
tory’s
se with
com-
DECOMPOSITION OF THE CLASSIFICATION PROBLEM
Before using any classifier algorithm, we had to think about the criteria of discrimination.
previous chapters and the several articles on the subject indicated there are many features
ever use but depending of the computing environment and the aims of the application system
were ought to select various of the features. The previous work on the subject helped to rep
ate the criteria, but we first had to evaluate them by ourself and make a decision concernin
efficiency. So that we coded algorithms to improve each features and draw diagrams and ca
figures to help in our decision.
Because it is a "free" research, we had none specific constraint but the Matlab computing
ronment. So we decided to collect as many features as we could. Thus the classifier algo
whatever it would be, would be as performant as is the number of features. The only cons
were the relatively speed of computation (the main commercial applications seem to be rea
systems) and the low memory rate needed during the computation (the Matlab simulator
really efficient considering this aspect).
This had to be completed with the building of an efficient database. The more we have ma
and the better it is for the classification improvement. We needed this first to improve our fe
extractors, and then to develop the various classifiers: The way to do so is to build the clas
by determine the parameters with training data, and verify the efficiency with others. Others
working on audio classification average a several hundreds of signals, and some reach
thousands. An important database of music signals were already stored in the labora
archives but they were lacking of enough speech samples. We thus completed the databa
speech records, collected through the laboratory staff. The advantage of this staff is that it
prices many different nationalities so that we recorded various langages and accents.
22
Speech/Music Discriminator
e also
id not
ercial
ied to
Eventually, comes the classification by itself. We had a panel of available classifiers. Ther
the constraint of Matlab computing environment forced us to use some classifiers which d
need so much memory. At least, it was a "good" constraint regarding on the future comm
applies which impose to be fast calculators. So our choice was almost unlimited and we tr
improve the most current or famous one.
.
Related works
Features improving
Material improving
CLASSIFICATION
Speech DAT RecordingARG Files
23
Speech/Music Discriminator
studied
scribed
fea-
carried
lue of
feature
e mean
. In the
, then
r the
d
nal.
as an
m of
cep-
rating
t on to
r, only
trum,w York,
CHARACTERISATION OF AUDIO SIGNAL: FEATURE EXTRACTIONAND TEMPORAL SEGMENTATION
As we noticed in the previous review of the related studies, several approaches have been
and each of these used different features and pattern classification techniques, and de
results on different material. In this report we study the relevancy of several of the different
tures previously suggested and especially of some we believe have useful properties. We
out tests on the same data. In this way we attempted to provide a comparative view of the va
the different types of features in speech / music discrimination.
Mathematical definition of the main features and improvingWe first started by using 15 music samples and 15 speech samples. We calculated the
value for each signal, plotted the results (standard deviation and mean), and determined th
and the mean of the variance of each feature first for all music samples, then for speeches
same way, we also give the variance of the mean of the features first for all music samples
for speeches. Later we refined with a larger database.
Cepstrum. As a preliminary view, let’s say the cepstrum measures the spectrum shape o
sound “color”. The term cepstrum was introduced by Bogert et al1 and has come to be accepte
terminology for the inverse Fourier transform of the logarithm of the power spectrum of a sig
They observed that the logarithm of the power spectrum of a signal containing an echo h
additive periodic component due to the echo, and thus the Fourier transform of the logarith
the power spectrum should exhibit a peak at the echo delay. They called this function the
strum, interchanging letters in the word spectrum because "in general, we find ourselves ope
on the frequency side in ways customary on the time side and vice versa.". Bogert et al. wen
define an extensive vocabulary to describe this new signal processing technique; howeve
the term cepstrum has been widely used.
1.B. P. Bogert, M. J. R. Healy, and J. W. Tukey. The Quefrency Alanysis of Time Series for Echoes: CepsPseudo-autocovariance, Cross-Cepstrum, and Saphe Cracking, pages 209-243. John Wiley and Sons, Ne1963.
24
Speech/Music Discriminator
l spec-
y first
result
trum is
error
s com-
n a
oise,
c than
cen-
higher
occur-
The cepstrum residual is the margin between the smoothing of the spectrum and the origina
trum.The cepstrum residual is thus calculated: The value of this feature is determined b
computing the cepstrum of the spectrogram by means of a Discrete Fourier Transform. The
is then smoothed over a time window, and the sound is resynthesized. The smooth spec
then compared to the original (unsmoothed) spectrum, to obtain an error value. Usually the
value is higher for voiced speech and music. The cepstrum resynthesis residual magnitude i
puted as follows:
Spectral Centroid. It represents the balancing point of the spectral power distribution withi
frame. Many types of music involve percussive sounds which, by including high-frequency n
result in a higher spectral mean. In addition, the excitation energies can be higher for musi
for speech, in which pitch stays within a range of fairly low values. As a result, the spectral
troid for music is, on average, higher than for speech. In addition, the spectral centroid has
values for unvoiced speech than it does for voiced speech. The spectral centroid for a frame
ing at time t is computed as follows:
CR X k[ ] Y k[ ]–( )2
k∑≡
SC
k̇X k[ ]k∑
X k[ ]k∑------------------≡
25
Speech/Music Discriminator
verall
nd.
s the
h more
transi-
ically
partic-
the
n a
he time
where k is an index corresponding to a frequency, or small band of frequencies, within the o
measured spectrum, and X[k] is the power of the signal at the corresponding frequency ba
Two different examples of Spectral Centroid
each represented by a∆
Spectral Flux.This feature measures frame-to-frame spectral difference thus it characterize
changing in the shape of the spectrum. Speech has a higher rate of range, and goes throug
drastic frame-to-frame changes than music. We can Note that speech alternates periods of
tion (consonant - vowel boundaries) and periods of relative stasis (vowels), where music typ
has a more constant rate of change. As a result the spectral flux value is higher for speech,
ulary unvoiced speech, than it is for music.
The Spectral ‘‘Flux’’ (also known as the Delta Spectrum Magnitude) is the 2-norm of
frame-to-frame spectral amplitude difference vector:
Zero-Crossing Rate.This value is the number of time-domain zero-voltage crossings withi
speech frame. In essence, the zero-crossing rate indicates the dominant frequency during t
period of the frame. This is a correlate of the spectral centroid.
SF X k[ ] X k 1+[ ]–≡
26
Speech/Music Discriminator
trum
y in the
alues
of the
ctral
rmine
ther
can’t
S
litude
more
Rolloff Point. This value measures the frequency below which 95% of the power in the spec
resides. Music, due to percussive sounds, attack transients, and the like, has more energ
high frequency ranges than speech. As a result, the spectral roll-off point exhibits higher v
for music and unvoiced speech, and lower values for voiced speech. This is a measure
‘‘skewness’’ of the spectral shape--the value is higher for right-skewed distributions. The spe
roll-off value for a frame is computed as follows:
SR = K, where
Two kinds of Rolloff Point of Spectrum
Pulse metric. This is a feature which uses long-time band-passed autocorrelations to dete
the amount of ‘‘rhythmicness’’ in usually a 5-second window. It does a good job telling whe
there’s a strong, driving beat (ie, techno, salsa, straightahead rock-and-roll) in the signal. It
detect rhythmic pulse in signals with rubato or other tempo changes.
Percentage of ‘‘Low-Energy’’ Frames.This value measures the proportion of frames with RM
power less than 50% of the mean RMS power within a one-second window. This is an amp
feature. The energy distribution for speech is more left-skewed than for music--there are
quiet frames--so this measure will be higher for speech than for music.
X k[ ] 0.95 X k[ ]k∑=
k K<∑
SpectrumAmplitude
Frequency
SpectrumAmplitude
Frequency
27
Speech/Music Discriminator
4 Hz
per-
second
g and
rgy in
ergy at
on the
ut lost
aining
ss and
te the
4Hz
when
tures.
emory
pro-
se fea-
rs will
to not
4 Hz modulation energy .Speech has a characteristic energy modulation peak around the
syllabic rate. We use a filterbank (with logarithmic scale) to convert the audio signal into 8
ceptual channels. We extract the energy in each band, bandpass filter each channel with a
order filter with a center frequency of 4 Hz, then calculate the short-term energy by squarin
smoothing the result. We normalize each channel’s 4 Hz energy by the overall channel ene
the frame, and sum the result from all channels. Speech tends to have more modulation en
4Hz than music does. Pratically, we can usually notice a peak around the 4Hz frequency
spectrum if the signal is a speech.
Left is the power spectrum of a music signal
Righ is the one of a speech signal: peak at 4Hz frequency
This is a relevant feature but uneasy to extract computationnally: The peak can be present b
in the mass of the frequencies in the spectrum. The best way would be to analyse all the tr
set and watch for the amplitude of the 4Hz frequency, and determine the mean for each cla
discriminate the two groups by thresholding or Gaussian classifier. We did not first comple
codes this way to keep our fastness computation. We limited ourselves in analyzing if the
peak was stronger than the both 3 and 5Hz’s ones.
Selection of the featuresWe have already said that the more we used a great number of features, the better it is
improving a classifier algorithm. So that we were tempted to select all of the previous fea
One constraint, and not the less, is the Matlab environnement: The processes need much m
when running and we can sometime have a lack of memory. Matlab is either a line-coded
grammer that means some schedules are unavailable as they could be in C-Langage. Tho
tures thus forbidden are all Pitch estimation patterns. Let us precise also that some classifie
be confused by some bad features (that will be a good aspect of the Decision Tree classifier
be confused this way).
SpectrumAmplitude
Frequency
SpectrumAmplitude
Frequency
28
Speech/Music Discriminator
:
levant
peech
ration
tually
es and
ounds
an and
garding
n if the
lations
ucture:
ignals
fea-
We decided to select then the most relevant features in our point of view as it is listed below
· Spectral Centroid
· Spectral Flux
· Rolloff Point
· Residual Cepstrum
· Zero-Crossing Rate
· 4Hz Modulation
In addition, we used the variance of these six features as well. Computing them, it was re
that some results could be interesting (Standart Deviation are sometime quite different for s
and music) and it increase easily the number of features.
Improving of the featuresWe needed to verify and confirm the relevancy of each feature before to pledge in the integ
of them in the classifying process. So we wrote some codes calculating, storing and even
evaluating the values of the various features and their variance on a collection of speech
music samples and plotted diagrams to visually estimate if our choice was efficient.
Later in this report is the audio material to be described, but we can already precise that s
were studied frame-by-frame, each frame being 100ms long. So that we determined the me
standard deviation for each feature and each signal. The length of the frame was chosen re
on the various previous studies and averaging. It seems to be a reasonable choice eve
real-time systems at a commercial production level need faster thus shorter frames. Calcu
can be everly conducted later without many modifications inside codes.
The respective codes can be found in appendix. They respect an identical programming str
Reading of the patterns of records (sampling frequency, length, ...), storage of the s
frame-by-frame, filtering if needed, calculation of the mean value and variance value of the
29
Speech/Music Discriminator
from
ture and plotting of a two-dimension variance-to-mean diagram distinguishing the speechthe samples. These diagrams are also available in appendix, but below is a quick review
.
Red crosses are speeches
Blue circles are music samples
30
Speech/Music Discriminator
s, at
ently
lots of
Spec-
clas-
repre-
male,
ll kind
usic
priate
rding
Using
rtable
izen-
om-am-
Discussion. All the features we decided to integrate are relevant. Looking at the diagram
least one of the two dimensions are distinct for music and speech.
The plotting allowed us to supress some of the signals in our material which were evid
unadapted (the plot was located far out of the usual range).
We can already notice that the Spectral Flux seems to be a good discriminator in that way p
each class are not really mixed, and that looking at the centers of mass of the diagrams, the
tral Centroid and the Zero-Crossing Rate are effectively correlated.
The Audio MaterialWe tried to cover the most largest panel of audio signals. Even if we limited ourselves on the
sification of the sounds into only two categories, music and speech, the material has to be
sentative of the audio range and it can be later used for experiment in finer classification (fe
male, rock, pop, jazz, ...).
A good collection of music samples was already stored on the lab computers, representing a
of music: classical, pop, techno, new age, rock, jazz). The singers were also considered as m1.
But the Audio Research Group (ARG) was lacking of speech signals or they were not appro
to our purpose: it was mainly short vowels or alphabetical letters or they were speech reco
through phone line. So it was decided to increase the ARG material by creating some new.
specific texts produced by the NOKIA Research Center, we recorded people voice with a po
DAT recorder. The advantage of our laboratory in Tampere is that we usually meet quite cit
ships so we collected various tongues and accents as:
1. This was a trouble in that a singer is an human being and then produce sounds with his body so herespects the patterns of speech. Here can be a problem of class definition. Regarding on the possible cmercial products, in radio broadcast field, we defined the songs as a music signal and processed the sples in such a way. We can also notice that songs are rarely “a capela” and then it exists a musicalbackground.
Table 4: Various tongues recorded with the DAT recorder
English French Scotish
Estonian Irish Russian
Finish Chinese Polish
Spanish Japanese Italian
Armenian
31
Speech/Music Discriminator
cord a
s, and
avail-
length.
In spite of the real and sadly lack of female researchers in our laboratory, we managed to re
little collection of female voices.
Then these records were computerized with a sound editor software as *.aif-extension file
then converted into wave files to be processed on Matlab environment. This material is now
able in the ARG’s files. The signals are at least 2-second long and average a 10-second
After selecting them, we worked on 101 music samples and 135 speech samples.
32
Speech/Music Discriminator
cation
neasy
rchers
. We
al Net-
nted
ctured
uch
get the
meters
ll the
gnal to
n 101
assify as
ssified
s long.
y.Here
CLASSIFICATION
We explained previously that we wanted to choose the most efficient and common classifi
functions. The fact is that many classifiers are nowadays available and it may be sometime u
to make the better choice. Currently, the discriminant functions used by the various resea
are Gaussian function, Linear Discriminant Analysis, k-Nearest-Neighbors, Neural Network
implemented two of them (Gaussian and k-Nearest-Neighbors), tempted to process a Neur
work algorithm but failed (Matlab software presented a too much lack of memory), impleme
one another based on Learning Vector Quantization, and worked on a new one: Tree-stru
classifier.
I do not intend to describe precisely the way of working of these classification functions (m
has been written on them) but below is an abstract of those we used. I advise the reader to
books quoted in references1&2 if he wishes to get more informations on the subject.
First of all, how to implement the classifiersWe know that the more we have samples in our training set, the more valuable are the para
of the discriminant function and the more efficient is the classifier. So the idea is to use a
samples we have, but one, as the training set. This last one is thus used as a “unknown” si
be classified. Then we iterate this process through all the material. That mean that, as we ow
music samples and 134 speech samples, we thus evaluate the parameters 235 times and cl
much signals. Then we verify how many music samples then speech samples are well cla
and calculate the rate of success.
As we previously precised, the signals are studied frame-by-frame and each frame is 100m
Gaussian discriminationWe used the Multinormal Gaussian classification assuming that each class was equally likel
is the probability of being of a vectorr:
1. “Machine learning”, Tom M. Mitchell, The Mac-Graw Hill Companies, Inc.2. “Pattern Recognition and Neural Networks”,B.D.Ripley, University of Oxford, Cambridge University
Press.
33
Speech/Music Discriminator
first
-
ot so
l-time
e pat-
t fea-
eature,
where µ is the mean andσ the standard deviation. These parameters are estimated by
improving the function through a training matrix1. Then to classify a unknown signal, we calcu
late the two probabilities of being music or speech and adopt the greatest.
This algorithm has a lot of advantages: First, it is simple to integrate. The codes2 are very easy to
write (it concerns just one equation to solve), and then, the computering cost is quite low (n
much loops). Regarding on the possible commercial applications, with the constraint of rea
systems, it may be one of the best processes.
The obtained results confirm our feeling. The calculation timelap depends of course on th
terns of the Unix stations but it already indicates that it is the most fast process we studied.
We used the whole set of features first. Below are the various rate of success.
We then thought about the possibility to optimize the algorithm using only the most relevan
tures. Consequently we launched the process independently, studying the relevancy of f
one after one. The results are tabled following:
1. A Training Matrix is a M*N matrix, N known samples as columns and M features as rows.2. code available in appendix
Table 5: Rates of Good Classification with the Gaussian method
global rate 92,34%
music rate 94,06%
speech rate 91,05%
P r µ σ, ,( ) 1
σ 2π--------------e
r µ–( ) 2σ2( )⁄=
34
Speech/Music Discriminator
y dur-
of the
strum.
inte-
nearest
cisely,
wo
.
We can verify that the modulation 4Hz feature is not very relevant as we estimated previousl
ing the mathematical despcription.
Regarding on this values we tempted to classify the signals by using just a combination
Spectral Flux and Spectral Centroid, or a combination of these two feature added to the Cep
The result is that neither combination produces better rates of classification. The most we
grate features, even with a low relevancy, the best is the classifier.
k-Nearest-Neighbors classification
The algorithm assumes all instances correspond ot points in a n-dimensional space. The
neighbors of an instance are defined in terms of the standard Euclidian distance. More pre
let an arbitrary instance x be described by the feature vector
(a1(x),a2(x),...,an(x))
where ar(x) denotes the value of therth attribute of instance x. Then the distance between t
instances xi and xj is defined to be d(xi, xj), where
Table 6: Rates of Good Classification with the Gaussian method, feature by feature
Feature Music(%) Speech(%) Global Rate (%)
Centroid 90,1 53,73 69,36
Rolloff 94,06 21,64 52,77
Flux 87,13 58,96 71,06
Cepstrum 70,3 59,7 64,26
Modulation 4Hz 7,92 87,31 53,19
Zero-Crossing Rate 93,07 29,1 56,6
d xi x j,( ) ar xi( ) ar x j( )–( )2
r 1=
n∑≡
35
Speech/Music Discriminator
earest
patible
t
ter, the
s from
ns of
s just
Rule:
So the class of an unknown sample is the most common value of the class of the k n
instances regarding on the above expression.
We first launch the process with the value 3 for the parameterk. The results are those below:
They are quite good rates of success. So we can consider the kNN methods is very com
with the speech/music discrimination.
We attempted to increase these values by modifying the parameterk. We tried several values bu
the casek=3 seems to be a good compromise because when we rise the value of the parame
classifier looks too far away from the plot to be considered and integrates too much instance
both the classes. The “unknown” instance is then drowned in the whole stack.
Neural NetworkThis kind of process was very famous in the 80’s when artificial intelligence and the patter
the brain’s way of working was studied. The details of this process are very complex so let u
say that the parameters of the algorithm are determined following the Perceptron Training
Table 7: Rate of Good Classification with the kNN classifier
global rate 94,47 %
music rate 89,11 %
speech rate 98,51 %
Table 8: Rate of Good Classification with the kNN classifier feature by feature
Feature Music(%) Speech(%) Global Rate (%)
Centroid 78,22 91,4 85,53
Rolloff 56,44 84,33 72,34
Flux 75,25 82,84 79,57
Cepstrum 72,28 88,06 81,28
Modulation 4Hz 97,03 11,19 48,09
Zero-Crossing Rate 75,25 85,82 81,28
36
Speech/Music Discriminator
lasses)
auses
There
e
seems
y, the
other
o-
hat is
or, and
The various ways to go from inputs (in our case, samples) to outputs (music and speech c
are multiples and weighted. So the learning problem is to determine a weight vector that c
the perceptron to produce the correct output value for each of the given training examples.
are several algorithms to determine these weights and we are not to describe them there.
Global aspect of the Neural Networks structure
We tempted to use Matlab predefined functions1 to conceptualize such a discriminator. W
quickly failed because such functions are supposed to work on very simple cases and it
there are a lot of iterations during the Perceptron Training process (After a 48-hour dela
processes were still running). We abandonned the use of Matlab functions and tempted an
method:
Learning Vector Quantization2 & 3. Vector Quantization is a classical method in signal pr
cessing to produce an approximation to the distribution of a single class by a codebook (t
also called the training set). Each incoming signal is mapped to the nearest codebook vect
that vector sent instead of the original signal.
The major advantages of formulating the VQ problem in terms of neural networks are:
1. Neural Networks functions, Demo Application, Matlab 5.3.2. “Possibilistic Clustering in Kohonen Networks for Vector Quantization” by D. Ghosh - A. P. Shivaprasad,Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore, INDIA3. World Wide Web’s URL: http:\\www.ee.eng.ohio-state.edu
37
Speech/Music Discriminator
VQ.
ithms
Q).
in the
lowing
- Since neural networks have parallel computer architecture, they may be used for real time
- The various algorithms available for training neural networks may be used forself-organization
of codebook in VQ.
- Since most of the neural network learning methods are adaptive, the VQ training algor
derived from these allow for the designing of codebooks in adaptive vector quantization (AV
Thus we build a new neural network with such a criterium.We used a code already existing
laboratory computers whom purpose is the LVQ-based classification.
Global aspect of the Learning Vector Quantization
Below are the various results:
We were not able to increase those rates and we prefered to spend more time on the fol
classifier.
Table 9: Rates of Good Classification with the LVQ-based classifier
global rate 45,11%
music rate 35,64%
speech rate 52,24%
1.The original patterns A,B,C are divided into vectors X
2. These vectors X are compared to a table
3. The index of the best matching pattern in the code-book is used. The code-book can be constructed in advance or dynamically
Index 1
Index 2
...
Index nIndex i
38
Speech/Music Discriminator
f node,
e fea-
e of the
e tree,
to the
oted at
algo-
es. In
assifier.
node is
to the
amples
Decision Tree algorithmDecision trees classifies instances by sorting them down the tree from the root to some lea
which provides the classification of the instance. Each node in the tree specifies a test of som
ture value of the instance, and each branch descending from that node corresponds to on
possibles values for this feature. An instance is classified by starting at the root node of th
testing the feature specified by this node, then moving down the tree branch corresponding
value of the feature in the given example. This process is then repeated for the subtree ro
the new node.
Usual aspect of a Decision Tree
Most algorithms that have been developed for learning decision tree are variations on a core
rithm that employs a top-down, greedy search through the space of possible decision tre
other words, it searches at each level or node, the most conveniant feature to be the best cl
Then this feature is used as the test at the root node of the tree. A descendant of the root
then created for each possible value of this feature, and the training examples are sorted
appropriate descendant node. The entire process is then repeated using the training ex
Root Node
Training Set S
All Features considered
Node 1.1
Training subSet S1.1
All Features but F0
Node1.2
Training subSet S1.2
All Features but F0
Node 2.1
Training subSet S2.1
Feature Fi
Node 2.2
Training subSet S2.2
Feature Fj
Node 2.3
Training subSet S2.3
Feature Fk
Node 2.4
Training subSet S2.4
Feature Fl
Node 2.5
Training subSet S2.5
Feature Fi
Node 3.1
Training subSet S3.1
Feature Fk
Node 3.2
Training subSet S3.2
Feature Fj
Class=1 Class=2
Class=1 Class=2
Class=1 Class=2
Class=1 Class=2Class=1 Class=2
Class=1
subSet is composed of thesamples which have thesame value for the max-
gain feature
39
Speech/Music Discriminator
ree. We
mea-
ding
n
re of
s
he
al by
eter-
s of
to one
lues of
is upon
rain-
alue).
associated with each descendant node to select the best feature to test at that point in the t
need thus a criteria to select the appropriate feature: the Entropy. This statistical property
sures theinformation gain, or how well a given feature separates the training examples accor
to their target classification. Here the Entropy of a collection S is:
wherepmusicis the proportion of music samples andpspeechis the proportion of speech samples i
S. In all calculation involving Entropy we define 0.log0 to be 0.We give Entropy as a measu
the impurity of a set of training samples relative to a feature. We thus calculate theInformation
Gain of a feature A relatively to the whole set S of a training signals in this way:
Sv is the subset of S for which feature A has valuev and Values(A) is the set of all possible value
for feature A. We determine thisInformation Gainfor each feature expected in the subtree at t
considered level and select the one which the higher value.
To sum up the process: we build a Decision Tree and then we classify an “unknown” sign
conducting it through the Tree looking at its features’ values. To build the Tree, we have to d
mine theInformation Gainfor each feature at each level as explained previously. So the value
the features for the whole stack of samples have to be discretized, so that a signal belongs
of the subclasses of the investigated features. First we attempted to distribute the various va
each feature between two subclasses: High values and low values, considering that all that
the mean of the values of the feature is High value and all that is below is Low value. A new t
ing matrix was thus completed only composed of 1 and 0 (1 for High value and 0 for Low v
Entropy s( ) pmusic pmusic( )log× pspeech pspeech( )log×–≡
Gain S A,( ) Entropy S( )SvS
--------Entropy Sv( )v Values A( )∈
∑–=
40
Speech/Music Discriminator
n bal-
ssian
ature,
everal
train-
l in
that it
The results are not so relevant as we wished, without being unacceptable:
We though that our way of discretizing the features’values was too much squarred. We the
anced for still discretize the feature’s values into two subclasses but with the help of the Gau
classifier as already used previously. Following are the results:
It seems not to be performant either but we detected an other failure: we study only one fe
one by one at each level in the Tree. It may be more relevant to examine a combination of s
features to much specify the discrimination at each level. So we worked, building adequate
ing matrix with the help of Gaussian classifier and combining feature. This work is stil
progress and we did not have the results when this report went to reproduction. But it seems
is an expensive computationnally way of working.
Table 10: Good Classification with Decision Tree method
Global Rate of Good Classification 85,11%
Rate of Good Classification of Music Signals 85,82%
Rate of Good Classification of SpeechSignals 84,16%
Table 11: Good Classification with Discretization of the Feature by Gaussian method
Global Rate of Good Classification 71,91%
Rate of Good Classification of Music Signals 68,32%
Rate of Good Classification of SpeechSignals 74,63%
41
Speech/Music Discriminator
pose),
fier and
classi-
n is a
But we
Gaus-
, our
e evo-
assifi-
ted by
It
e other
DiscussionBecause it is a “free” research (this job has not been done for a precise commercial pur
there can not be some answers as “this classifier is the best we can use”. Gaussian classi
k-Nearest-Neighbors are the quickest algorithms and actually give the best results of good
fication. Neural Networks method must be studied by some experts: this way of classificatio
very complex one and it may be necessary to specify this study as an entire research topic.
are to not forget the Decision Tree method: that is right our results are less performant than
sian and kNN’s ones, added to the fact that it is computationnally more expensive. Anyway
work was just a preliminary to a more structured and specific possible project. Some possibl
lutions have already been given and can be completed as a future work. Plus this kind of cl
cation is worth to be concentrated on because of its specifications: Features are selec
calculating theirinformation gain. Thus features with low relevancy are almost “forgotten”.
classifies not only the signals but also the features themselves that has not been done by th
classifiers.
42
Speech/Music Discriminator
id. We
f clas-
an be
y rele-
% and
with
easy
nnally
Tree
study
e per-
e con-
tion
being
nals.
easy to
peech
unica-
e that
re rel-
CONCLUSION
Our duty was to present a wide-open panel of speech / music discriminators, and so we d
conducted a large study on feature extraction on a first step and went through several king o
sifiers then.
To reach a rate more than 95% of good classification reveals that good discriminators c
implemented further. Spectral Centroid and this particular feature Cepstrum seem to be ver
vant features (kNN classifier gives independant rate of good classification as high as 85,53
81,28%, respectively). Spectral Flux is not to be forgotten with 71,06% of good classification
the Gaussian classifier and 79,57% with the kNN classifier. Classifiers are not to be
improved: Gaussian and kNN classifiers give the highest rates and are very computatio
cheap. Neural Networks are to be given up because of a too much complexity. But Decision
classifier is up to better results as given in this report. As we already said it is worse to be
because this classifier also improve the “purity” of the features with theinformation gaincalcula-
tion. Here is the future work to be done.
Actually, it appears that much work is still to be done. Surely, classifiers themselves has to b
fected but this is the usual way of proceeding, and many independant and parallel works ar
ducted at this time. Actually, the main part of the future work will concern the feature extrac
as well: modulation 4Hz is well characterizing the patterns of speeches or at least human
voices but it is still a not enough categorical feature to allow the exact classification of the sig
We have to increase the volume of such attributes. We have to do so that these patterns are
extract, fast to calculate, looking on the main part of the possible commercial applications. S
processing is a main topic of interest in the audio processing field and the rise of telecomm
tion allows actually money to be distributed and projects to be conducted. It may be possibl
new patterns will be implemented as cepstrum was at its time but let’s not forget that the mo
evant features are those on which many precise and deep studies were leaded on.
43
Speech/Music Discriminator
GLOSSARY
ARG. Audio Research Group
AVQ. Adaptative Vector Quantization
DAMA. Content on Audio and Music Analysis
DAT. Digital Audio Tape
DMI. Digital Media Institute
DSP.Digital Sognal Processing
DT. Decision Tree
EER. Equal Error Rate
HMM. Hidden Markov Model
ICASSP. International Conference on Acoustics, Speech and Signal Processing
IEEE. Institute of Eletrical and Electronics Engineers
kNN. k-Nearest-Neighbors
LVQ. Learning Vector Quantization
NN. Neural Networks
RMS. Root Mean Square
SPL. Signal Processing Laboratory
TUT. Tampere University of Technology
VQ. Vector Quantization
ZCR. Xero-Crossing Rate
44
Speech/Music Discriminator
Appendices
45