14
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Page 2: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

A two-stage speech activity detection system considering fractal aspects of prosody

Soheil Shafiee a, Farshad Almasganj a, Bahram Vazirnezhad a,b,*, Ayyoob Jafari a

a Biomedical Engineering Department, Amirkabir University of Technology (Polytechnic of Tehran), Iranb Iran Electronics Research Institute, Iran

a r t i c l e i n f o

Article history:Received 11 August 2008Received in revised form 6 November 2009Available online 4 January 2010

Communicated by O. Siohan

Keywords:Speech activity detectionProsodyFractal dimension

a b s t r a c t

Speech Activity Detectors (SADs) are essential in the noisy environments to provide an acceptable perfor-mance in the speech applications, such as speech recognition tasks. In this paper, a two-stage speechactivity detection system is presented which at first takes advantage of a voice activity detector to discardpause segments out of the audio signals; this is done even in presence of stationary background noises. Inthe second stage, the remained segments are classified into speech or non-speech. To find the best featureset in speech/non-speech classification, a large set of robust features are introduced; the optimal subsetof these features are chosen by applying a Genetic Algorithm (GA) to the initial feature set. It has beendiscovered that fractal dimensions of numeric series of prosodic features are the most speech/non-speechdifferentiating features. Models of the system are trained over a Farsi database, FARSDAT, however, testexperiments on the TIMIT English database have been also conducted. Employing the SAD system in con-junction with an ASR system, has been resulted in a relative Word Error Rate (WER) reduction of as highas 28.3%.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Automatic speech activity detection has many applicationsespecially in the noisy environments. For example in a speech rec-ognition task, the efficiency in automatic detection of speechboundaries considerably influences the accuracy of the recognitionengine. Even a minor improvement in speech boundary detectionimproves the overall system performance in long run. The mostcommon methods for speech endpoint detection are based onshort-time spectral energy (Savoji, 1989; Mauuary, 1994; Martinet al., 2000; Ramirez et al., 2005). In (Ramirez et al., 2004), theinformation of long-term spectral energy in sub-bands is the mainidea of speech detection. These approaches work well over cleansignal, but the performance is not acceptable in presence of variouskinds of noise. Typically an adaptive threshold based on the fea-tures of the energy profile is employed to differentiate speech seg-ments from silence/background voices (Martin and Mauuary,2006; Prasad et al., 2002). In (Martin and Mauuary, 2006), a dy-namic approach is employed to detect speech and non-speech seg-ments. A finite state machine (FSM) with five states decideswhether an input frame is speech or not, in a manner that the stateof a frame depends on the state of its previous frame. Energy andsome duration constraints are scrutinized in order to make deci-

sion for every input frame. In this approach, some non-speechnoises specified by high energy in a short duration can be rejected,which is an advantage. Moreover, low length pauses between nat-ural speech will not be classified as non-speech. In order to in-crease the algorithm robustness in presence of slowly variablebackground noise, in each step, the energy threshold is updatedbased on the statistics of the last non-speech frames. Lack of com-plex features differentiating speech from non-speech voice andnoise is the main drawback of the approach introduced in (Martinand Mauuary, 2006). Using ‘‘Teager” energy is another useful tech-nique in speech/non-speech separation. This operator, dependsboth on the signal energy and the basic frequency of the signal(Ying et al., 1993; Chen et al., 2005). Teager energy is robust innoisy environments, however, the main drawback is its weaknessin dealing with signals mixed with certain types of non-speechvoices.

Entropy as a measure of order and similarity is used in manySAD systems. This parameter that was firstly used by Shannon ininformation technology shows the degree of organization or theuncertainty level of the signal (Wu and Wang, 2005). Entropycan be used in both time (Waheed et al., 2002) and frequency do-main (called spectral entropy) (Shen et al., 1998; Basu, 2003). En-tropy is a powerful feature in speech and noise discrimination;however, the main drawback is the inability to discard certaintypes of non-speech voices produced by speakers such as coughand breath.

MFCCs (Mel Frequency Cepstral Coefficients) and their deriva-tives are often used in speech processing applications. The main

0167-8655/$ - see front matter � 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.patrec.2009.12.014

* Corresponding author. Address: Biomedical Engineering Department, AmirkabirUniversity of Technology (Polytechnic of Tehran), Iran. Fax: +98 21 6649 5655.

E-mail addresses: [email protected] (S. Shafiee), [email protected] (F. Almasganj),[email protected] (B. Vazirnezhad), [email protected] (A. Jafari).

Pattern Recognition Letters 31 (2010) 936–948

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Page 3: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

idea of calculating these coefficients comes from hearing mecha-nism in human. In (Martin and Mauuary, 2006), after Linear Dis-criminative Analysis (LDA) of MFCC parameters, they are used ina FSM to determine weather a frame is speech or not. The problemof using just MFCCs is that they are not completely capable to dif-ferentiate between speech and non-speech signals generated byspeaker, such as cough and breath, as experiments in this paperconfirm. In (Padrell et al., 2005) MFCCs are used besides other fea-tures like short-time energy, sub-band energy and zero crossingrate (ZCR) as discriminative features and finally a Support VectorMachine (SVM) is employed to classify speech and non-speech sig-nals. Cepstral variability is another approach used in (Skorik andBerthommier, 2000). These complementary features significantlyimprove the classification task, as these kind of features haveimportant information on entropy and the order of signal.

The spectral distributions of speech and non-speech signals arenot alike; wavelet transform would be a suitable technique toshow the difference, i.e. the sub-band energies could be comparedin wavelet domain. This approach is mainly used in Voice ActivityDetection (VAD) applications (Shaojun et al., 2004; Chen et al.,2005). Their efficiency in SAD systems is not comprehensivelystudied.

Some researches have been employed different forms of gar-bage models to detect and reject non-speech parts (Villarrubiaand Acero, 1993). In these approaches, to have good estimationsof parameters in non-speech models, a large amount of trainingdata is required.

In this work, we have considered features with high discrimina-tion power and robustness against various background non-speech/noise. Fractal dimension of short time prosodic featureshave been proposed in this paper. Prosodic patterns differ exten-sively in different languages; however, all languages, obey someinherently fixed rules and specifications because of anatomicalsimilarities of vocal tracts of humans (Hori and Furui, 2003; Koum-pis and Renals, 2001). Autocorrelation is a well-known techniqueto compute the pitch frequency. Pitch frequency is mainly limitedin a special range and does not have abrupt changes along signal. In(Tian et al., 2002), the smoothness of the variations of the autocor-relation function is used as the main feature for speech/non-speechdiscrimination. In this work, two criteria are used to estimate thesmoothness of the autocorrelation function, which are called MeanCrossing Ratio and Energy Distribution Coefficient.

Different classifiers have been used to classify audio signals intospeech/non-speech. Neural networks as a supervised techniquehave been widely used in classification problems. In (Hussainet al., 2000) two kinds of neural networks have been used for thispurpose. An ADAptive LINEar network (ADALINE) with Widrow–Hoff training rule and a two layer Perceptron network. The disad-vantage for neural networks is costly training procedure. CepstralLinear Predictive Coding (LPC) is the main feature to train the neu-ral networks. In (Beaufays et al., 2003), the features are extractedfrom signal frames and used to train a three layer feed forward net-work with 400 neurons in hidden layer to detect speech signals. Toevaluate this speech detector, a speech recognition system is usedwith and without this stage and WER is calculated for each of theconditions; by using the speech detector, error rate shows 12%reduction.

Decision tree is another supervised classification techniqueused in SAD systems. In (Padrell et al., 2005), a 31-dimensional fea-ture vector is used as discriminative feature vector and a decisiontree is used as speech/non-speech classifier.

SVM is a powerful method in classifying an input space into twoclasses and the convergence speed in the training phase is fasterthan other classification techniques. However, in comparison withneural networks, SVM is not efficient in multi-class classificationtasks. SVM has been extensively used in speech/non-speech classi-

fication problem (Abdulla et al., 2003; Enqing et al., 2002). In(Xianbo and Guangshu, 2005), an adaptive SVM algorithm has beenused, in which, the support vectors of SVM are updated in eachstep. The feature vector in this work consists of MFCCs, short-timeenergy, sub-band energy and ZCR.

In this paper, which is an extension report of our previous workpresented in (Shafiee et al., 2008), a two-stage approach is pro-posed for the SAD system. First, a segmentation block detectshigh-energy segments in signal, which can potentially be speech.These segments will be subject to further analysis in the followingspeech/non-speech classifier. It should be clarified here that inlarge part of the literature the term VAD is used in place of SAD;however, in this paper SAD is supposed to mean speech detectorwhile VAD is simply a classification system to detect any sound,including speech and non-speech segments as cough, laugh,breath, etc. The conventional MFCC-based features are the mainfeatures to segment auditory signals to speech/non-speech parts.Moreover, prosodic features have been employed directly and indi-rectly, as duration, energy and fractal calculation of the autocorre-lation numeric series to increase the performance of the classifier.Since, there may be a large amount of redundancy in some of thesefeatures; GA is applied to find an optimal subset of the feature vec-tor to be employed in the classification task. Finally, a SVM classi-fier is exploited to discriminate speech from non-speech segments.

The remaining parts of the paper are organized as follows; Sec-tion 2 describes the framework of the proposed system. In Section3, databases and evaluation methods are introduced. Section 4 de-tails the first stage of the system. In Section 5, features to be used inspeech/non-speech classification have been discussed. Featureselection using GA is also presented in this section. In Section 6,the SVM classifier will be discussed. Experimental results are pre-sented in Section 7. Sections 8 and 9 are discussion and conclusion,respectively.

2. The overall framework of the system

The proposed SAD system consists of two basic stages, as shownin Fig. 1. The first stage is a ‘‘segmentation block” which keepspotentially speech parts and rejects other parts which are certainlynon-speech, considering energy and duration aspects simulta-neously. Segmentation block utilizes a FSM which will be ex-plained in details later. The main idea of the FSM is from (Martinand Mauuary, 2006), which relies on the fact that speech parts ofan audio signal exist among segments with certain energy andduration. The second stage is a ‘‘speech/non-speech discriminator”which tags the remaining segments into speech and non-speech.For this purpose, a large set of features are computed, and finally,a SVM classifier is used to decide whether the segment is speechor not. The features are mainly divided into two main groupswhich are MFCC based features and prosodic and autocorrela-tion-based features. The features will be discussed in detail in Sec-tion 5.

The two-stage SAD system has an excellent performance in thepresence of both stationary and non-stationary noises. In the firststage, the segmentation block is applied to the audio signal andconsequently in the next stage, speech/non-speech classificationis made for each segment. In this manner, the system is able to re-ject transient noise segments which comprise high level energies.

Fig. 1. The block diagram of the proposed SAD system.

S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948 937

Page 4: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

The SAD system is not assigned to eliminate background noisesoverlapped with speech segments. In such a situation, the solutionis the application of a post-processing unit just after the SAD sys-tem to enhance the input speech segments prior to applying toan ASR engine.

To detail the approach, after introducing the databases in Sec-tion 3, segmentation block and the second stage of the system willbe described in detail in Sections 4 and 5, respectively.

3. Materials and methods, corpus and empirical workbench

3.1. Speech databases

We have used different databases to train and test the proposedSAD system. The first used speech database is FARSDAT (Bijankhanand Sheikhzadegan, 1994). FARSDAT consists of 6000 sentences ut-tered by 304 Farsi speakers. Each speaker has uttered 20 sentences inthe acoustic booth of the Linguistics Laboratory of the University ofTehran. The utterances are recorded with a sampling rate of22050 Hz and are phonetically transcribed. A SONY cardioids dy-namic microphone with a frequency response within 80–16 kHzrange has been used. The distance between microphone and lip po-sition of the speaker was about 12 cm. The speech was collectedusing sound blaster hardware cards, installed in four 80,486 IBMmicrocomputers and sampled at 22,050 Hz samples per second at16 bit resolution. Signal to noise ratio of the FARSDAT is about 31 db.

In this work, these speech signals are fed into the segmentationblock, introduced in Section 4, and 8579 segments detected asspeech are collected. These speech segments are used to trainand evaluate different parts of the proposed SAD. to test the perfor-mance of the system over a language other than Farsi, ‘‘TIMIT”database, which is a well-known English speech corpus has beenused In order to evaluate the language independency of the pro-posed system. Moreover, the performance of an ASR system hasbeen examined with and without using SAD system at the front-end speech recognition task. The ASR system had been trained bythe train set of TIMIT and is evaluated by its test set from 24 speak-ers distorted by non-speech segments manually added in the pauseintervals between speech segments.

3.2. Non-speech databases

FARSDAT does not contain sufficient non-speech segments.Therefore, non-speech signals were collected or recorded in the fol-lowing three ways. Totally, 10,892 non-speech segments have beenused in this study.

3.2.1. Non-speech parts of the T-FARSDAT databaseT-FARSDAT is a Farsi corpus consisting of phonetically hand-la-

beled telephony conversations (Bijankhan et al., 2003). This data-base consists of conversation files from 64 Farsi speakersrecorded in a sampling rate of 11,025 Hz. In this database, non-speech parts of conversations, such as breath and laugh, are la-beled. These parts have been used as a source for non-speech seg-ment collection.

3.2.2. Non-speech signals from internet available sourcesSome internet sources, such as FREESOUND, include non-speech

sounds recorded in different formats and sampling rates. Samplingrate of all of these samples has been changed into 22,050 Hz to beunified for this work.

3.2.3. Recorded non-speech signalsIn order to increase the variety of non-speech samples, some

frequent non-speech samples have been recorded especially for

this work. These samples have been uttered by 10 speakers and in-clude cough, laugh, sneeze, crying, hiccup, scream, breath, etc.

3.3. Noise database

In order to evaluate the overall system in the presence of back-ground noises, four different additive background noises, includingwhite, pink, babble and factory from NOISEX-92 have been exam-ined. The sampling frequency of these signals has been changedfrom 19,980 Hz to 22,050 Hz in order to be consistent with theother speech and non-speech samples.

3.4. Evaluation parameters

The proposed system, as shown in Fig. 1, consists of two mainstages. In this paper, these two stages have been evaluated sepa-rately, and the overall performance is also evaluated by conductinga speech recognition experiment with/without SAD system.

The first stage of the SAD is a segmentation block to detectaudio segments. This block of the system is evaluated using a sig-nal consisting of 20 utterances. The audio segments have beenmanually labeled. Two parameters, HR0 and HR1 have been usedto evaluate the segmentation block.

tHR0 ¼ ðThe total duration of correctly detected silenceÞ=ðTotal duration of silence in the signalÞ ð1Þ

tHR1¼ðThe total duration of correctly detected audio segmentsÞ=ðTotal duration of audio segments in the signalÞ ð2Þ

The second stage of the SAD system classifies audio segmentsinto speech and non-speech. In order to evaluate this stage, twocriteria have been proposed: ‘‘non-speech segment rejection ratio”or ‘‘hit rate for non-speech” and ‘‘speech segment false rejectionratio” or ‘‘hit rate for non-speech” (Górriz et al., 2005). These crite-ria are called HR0 (Hit rate for non-speech) and HR1 (Hit rate forspeech), respectively. These parameters are defined as follows:

HR0 ¼ ðThe number of non-speech segments detected asnon-speech segmentsÞ=ðTotal number of non-speechsegmentsÞ ð3Þ

HR1 ¼ ðThe number of speech segments detected as speechsegmentsÞ=ðTotal number of speech segmentsÞ ð4Þ

In addition, to evaluate the ASR system performance with/with-out the SAD system, Word Recognition Rate (WRR) has been mea-sured for the ASR system.

WRR ¼ TW � SW � DW � IWTW

ð5Þ

where TW is the total number of the words,SW sits for the numberof substituted words, DW is the number of deleted words and IW isthe number of false insertions.

4. Segmentation module, the first stage of the proposed SAD

The proposed SAD system, at the first stage, detects those partsof the signal which are likely to be speech. These are often high-en-ergy segments of the signal which have a reasonable duration to bespeech. Errors in this section may cause a false rejection of speechsegments. These errors degrade the overall performance of the SADsystem. FSM is a dynamic technique used to extract likely to bespeech segments from the input audio signal similar to the ap-proach in (Martin and Mauuary, 2006). The FSM proposed in thiswork, applies duration constraints besides energy constraints, to

938 S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948

Page 5: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

reject non-speech segments, especially those ones with high en-ergy levels and small durations. The inputs of the FSM are framesof the audio signal. Energy level across the signal is compared withan adaptive threshold. The energy threshold is updated periodi-cally based on the statistics driven from the last non-speechframes. The adaptive threshold increases the robustness of the sys-tem to semi-stationary background noises with variable ampli-tude. Fig. 2 shows the proposed FSM.

The FSM shown in Fig. 2 has five states: (1) low-energy, (2)high-energy presumption, (3) high-energy, (4) brief energyincreasing and (5) high-energy continuation. The transition regimechecks energy and duration conditions to determine the presentstate. Constraints and actions are shown in Fig. 2 with letters Cand A, respectively. In each transition, some actions take place.Constraints and actions are defined in Table 1.

The energy threshold update regime is based on the statistics ofthe few last non-speech frames similar to (Prasad et al., 2002). Forevery frame of the signal, the updated threshold value is calculatedby the following equation.

Tnew ¼ ð1� pÞTold þ pTsilence ð6Þ

where Told is the previous energy threshold, Tsilence is the mean en-ergy of the last eight non-speech frames and 0 < p < 1 is determinedbased on the variation rate of energy for non-speech frames. Theframes here contain 512 samples with a sampling frequency of22,050 Hz. The energies of the last eight non-speech frames are al-ways stored in Tsilence array. The variation rate of background noiseis calculated as dnew/dold ratio, where dnew and dold are noise energiescalculated from updated and previous energy arrays, respectively.Table 2 shows the P value determination scheme.

In the proposed FSM, frame sequences with more than 70 mswith certain energy level, are labeled as likely to be speech andlow energy segments with a duration of more than 116 ms willbe rejected. The algorithm checks the duration, when a low to highenergy transition is detected; in this case if the energy of the inputframes remains higher than the threshold, at least for 70 ms, itwould be labeled as a likely to be speech segment. In contrary,when the energy level remains at least 116 ms below the thresh-old, the frames will be rejected. The energy threshold is adaptivelyupdated in each step. In this scheme, a small duration high energynoise or a brief silence in the middle of a speech segment wouldnot switch the state. There are five different states for the framesof the input signal. The five states of FSM are then used to makea final decision on whether the frame sequence is likely to bespeech or non-speech. The original five states and the final labelsFSM are shown in Fig. 3, for a sample audio signal. In this figure,five states of the FSM, from 1 to 5, are shown on the vertical axiswith rational numbers 0.2, 0.4, 0.6, 0.8 and 1, respectively. The final

labels are shown with the dashed line with the higher level forspeech segments.

In the next stage, after applying FSM, the remaining segmentsare subject to further processing. Sections 5 and 6 explain featuresand the proposed SVM used for speech/non-speech classificationtask in the next stage of SAD system.

5. Feature extraction and selection

As mentioned earlier, the second stage of the proposed system,classifies the remaining segments, which are likely to be speech seg-ments, into speech/non-speech. For this purpose, a segment is di-vided into a number of overlapping frames with the same lengths.Here, a frame length of 256 samples (for sampling frequency of22,050 it leads to 11.6 ms frames), with overlap of 128 samples(5.8 ms) between adjacent frames is considered. Next to the framingprocess, a number of MFCC-based features are calculated distinc-tively for each of the frames. The average value for features is calcu-lated across frames to have an averaged unit MFCC-based featurevector. Some extra features extracted from the same segment willbe added to the feature vector, and finally, a SVM binary classifier willdecide on whether or not the segment is speech. In the following sub-sections, feature extraction and selection methods will be explained.

5.1. Features

One of the main goals of this work is to find the most discrim-inating features for speech/non-speech classification of audio seg-Fig. 2. Schematic of the segmentation block.

Table 1Actions and constraints in the state transition regime FSM.

Actions Constraints

A1: LD = LD + 1 C1: Frame Energy < Energy ThresholdA2: HD = 1A3: LD = LD + HD C2: High Energy Duration (HD) > 70 msA4: HD = HD + 1A5: LD = 1 C3: Low Energy Duration (LD) > 116 msA6: LD = HD = 0

Table 2‘‘p” value determination, based on the dnew/dold ratio.

dnew/dold P value

dnew/dold P1.25 0.251.25 Pdnew/dold P1.1 0.21.1 Pdnew/dold P1 0.151 Pdnew/dold 0.10

Fig. 3. The original 5 states of the FSM (red solid line), and the corresponding finallabels (blue dashed line).

S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948 939

Page 6: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

ments. The most differentiating features out of a crude set of fea-tures have been discovered using GA. The features investigatedhere can be mainly divided into two groups: (1) MFCC-based fea-tures and (2) prosodic features. In the following subsections, fea-tures will be discussed in more detail.

5.1.1. MFCC-based featuresMel Frequency Cepstral Coefficients or MFCCs and their

derivatives are known to be the most common features in speechprocessing applications (Padrell et al., 2005; Skorik, and Berthom-mier, 2000). To compute these features from signal, a filter bank isapplied to the signal spectrum. The center frequencies of these fil-ters are distributed in Mel scale. MFCCs are cosine transform of thelogarithm of filter banks’ outputs. In this work, the first 13 MFCCsand their derivatives are exploited. As an audio segment consists ofa number of frames, the mean values of MFCCs can be calculatedover frames of a segment. As a result, a feature vector of 13 com-ponents called mean-MFCC can be calculated. at the second step,first derivatives of MFCCs, for the consecutive frames in a segment,are calculated by the following equation.

DMFCCi½n� ¼ Ci½nþ 1� � Ci½n�; 1 6 n 6 N � 1; 1 6 i 6 13

ð7Þ

where Ci(n) is the ith MFCC from the nth frame in a segment, and Nis the total number of frames in a segment. Moreover, a new featurevector consisting of maximum values of DMFCC[n] is obtained for asegment as follows.

Di ¼maxnðDMFCCi½n�Þ;1 6 n 6 N � 1;1 6 i 6 13 ð8Þ

where Di is the maximum value of the ith component in DMFCC vec-tor across the segment. This vector is called max-diff-MFCC. Thediscrimination ability of various components of mean-MFCC vectorsis shown in Fig. 4, which shows the histograms over a large numberof segments.

As shown in Fig. 4, coefficients with indices 1, 3, 10 and 12 seemto be better in discrimination speech and non-speech. Fig. 5 showsthe histograms of max-diff-MFCC coefficients over speech/non-

speech segments, distinctively. The coefficients with indices 1, 2,12 and 13 seem to have the maximum discriminative power.

5.1.2. Autocorrelation-based featuresAutocorrelation function has a tight relation to the pitch trajec-

tory in speech signals, and one of the most well-known techniquesin pitch frequency calculation of speech signals. The pitch fre-quency of a speech signal could vary in a limited range. Autocorre-lation function can be used to extract features which are useful inspeech/non-speech discrimination. Autocorrelation function forthe frame l, with N samples of sequence s, is given by the followingequation.

UlðmÞ ¼XN�m

n¼1

slðnÞslðnþmÞ ð9Þ

The maximum output of the equation always occurs at m = 0. Ul

can be normalized by dividing Ul(m) by Ul(0). This value is called‘‘Normalized AutoCorrelation Function” (NACF) similar to (Tianet al., 2002). NACFs are calculated for all of the frames taken fromthe segment. Index and magnitude of the first peak of NACF func-tion (the peak at m = 0 is ignored) are called PI and PM, respec-tively. The frames of a segment have a series of PI and PM valuescalled PIS and PMS, respectively. The length of these series dependson the length of a segment, or the number of frames in the seg-ment. An example of PIS and PMS extracted from a speech and anon-speech segment is shown in Fig. 6.

As shown in Fig. 6, PIS and PMS of a speech segment have lessvariation than a non-speech one. So, smoothness of PIS and PMSsequences extracted from audio segments is an important measureto distinguish between speech and non-speech segments. Here,three criteria are introduced to measure the smoothness of PISand PMS, as brought in the following subsections.

5.1.2.1. Mean crossing ratio of PIS and PMS. Mean Crossing Ratio(MCR) of PIS and PMS sequences are employed to discriminatespeech/non-speech segments. These features are defined as thenumber of times that PIS and PMS series cross their mean values.This can be formalized as follows, in which x(l) is PIS or PMS.

Fig. 4. Distributions of various components of mean-MFCC vector for speech and non-speech segments.

940 S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948

Page 7: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

MCR ¼ 1L

XL

l¼2

fjsgn½xðlÞ �M� � sgn½xðl� 1Þ �M�jg ð10Þ

where x is PIS or PMS, l is the index component in x, M ¼ 1L

� �PLl¼1xðlÞ

and L is the length of PIS or PMS (i.e. the total number of frames inthe audio segment). These features are called PIS–MCR and PMS–MCR.

Histograms of these two features are shown in Fig. 7(a) and (b).It could be seen in figures that they are just able to partially sepa-rate speech and non-speech segments and some more robust fea-tures are then needed. Although the MCR distribution for speech

and non-speech are almost similar in Fig. 7, some kinds of non-speech, like breath, have a very high MCR which can make themdistinguishable from speech. Fig. 6 shows such a case for a typicalbreath and a speech signal.

5.1.2.2. Energy distribution of PIS and PMS. Energy Distribution Coef-ficient (EDC) of PIS and PMS sequences are two other features ap-plied in this work. EDC is originally inspired from the idea thatnon-smooth signals have greater frequency components thansmooth ones. This means that for non-smooth signals, the fre-quency components distribution is mainly located in higher fre-

Fig. 5. Distributions of ‘‘max-diff-MFCC” coefficients for speech and non-speech segments.

Fig. 6. (a) An audio signal, (b) Extracted PIS and (c) Extracted PMS

S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948 941

Page 8: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

quencies. Correspondingly, the energy distributions of PIS and PMSextracted from speech segments are mostly expected to be locatedin lower frequencies, and vice versa for non-speech segments. EDCparameter for an audio segment is calculated as follows.

EDC ¼

PLl¼1

lxl

� �

LPLl¼1

xl

� � ð11Þ

where xl is the spectral amplitude of the lth sample of the DFT ofPIS or PMS sequences with L samples. By applying Eq. (11) to PISand PMS sequences, two parameters called PIS–EDC and PMS–EDCwould be obtained, which are used as two other features in thiswork. These two features have been calculated for the speech/non-speech segments. The histograms are shown in Fig. 8(a) and(b).

5.1.2.3. Fractal dimension of PIS and PMS. Recently, many research-ers have treated speech as a chaotic, complex and unpredictablephenomenon, and many nonlinear processes have been success-fully exploited in speech processing (Kokkinos and Maragos,2005; Banbrook and McLaughlin, 1994). Fractal dimension is oneof the conventional methods used to describe the geometry anddimension of state space attractors of chaotic phenomena. Ofcourse, in this paper, we have no intention to discuss chaotic nat-ure of speech, but it is just intended to use fractal dimension ofprosodic numeric series as a measure of irregularity. It is believedthat irregular signals have greater fractal dimension than the reg-ular ones. As mentioned previously, PIS and PMS extracted fromnon-speech segments are usually more irregular than the ones ex-tracted from speech segments. Petrosian which is a well-known

method to calculate fractal dimension is used in this work (Estelleret al., 2001; Hilborn, 2001). In this method, fractal dimension is cal-culated as follows.

D ¼ log10N

log10N þ log10N

Nþ0:4ND

� � ð12Þ

where N is the length of the sequence and ND is the number of timesthat the direction of variations between consecutive samples hasbeen changed in the sequence. The method for calculation of ND

constructs a binary sequence by subtracting consecutive samples.From the sequence of subtractions, a binary sequence of + 1 and�1 is created by assigning on whether the result of the subtractionis positive or negative, respectively. Sequence, here, can be PIS orPMS described previously in Section 5.1.2. Applying this functionto PIS and PMS, two features can be extracted for audio segments.We have computed these two features for speech/non-speech seg-ments. The histograms are shown in Fig. 9(a) and (b). From the fig-ure, the proposed features seem to be powerful features forseparating speech from non-speech.

Fractal features are basically defined on the self-similarity spec-ification of a signal. For a speech segment, especially for a vowel,this self-similarity must be more realized than for a non-speechone. The low-level overlapping histograms in Fig. 9 and the featureselection study discussed in Section 5.2, verify the high capabilityof these two new introduced features in discriminating speechfrom non-speech.

5.1.3. DurationDuration of a segment is another feature discussed in this work.

Some non-speech events such as breath, cough, sneeze, etc. gener-

Fig. 7. (a) PIS-MCR distributions and (b) PMS-MCR distributions.

Fig. 8. (a) PIS-EDC distributions and (b) PMS-EDC distributions.

942 S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948

Page 9: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

ally have short durations. For such non-speech segments, durationis a constructive feature within the set of features for discrimina-tion. Histograms of this feature for speech and non-speech seg-ments could be compared in Fig. 10. There is a considerableoverlap between these two distributions which shows the limita-tion of this feature in speech-non-speech classification.

5.1.4. Energy gradientEnergy variation in a segment is another prosodic feature which

has been used in this study. To calculate this feature, next to break-ing the audio segment to sub-frames, frames energy is calculateddistinctively and energy gradient is given by

DE½n� ¼ absðE½nþ 1� � E½n�Þ;1 6 n 6 N � 1 ð13Þ

where N is the total number of frames in the segment. Mean ofDE[n] is computed to have a unique value of this parameter foran audio segment. Normalized histograms of this feature for speechand non-speech segments, as shown in Fig. 11, have a wide-spreadoverlap region.

5.2. Feature selection

There are 34 features introduced in Section 5.1, which are consid-ered to be fed to a classifier to classify audio segments. Many fea-tures may contain overlapped information and could be prunedwith a minimal loss of information. It must be notified that the mostcommon classifiers are mostly suffering from different limitations.An increase in the feature vector dimension, generally not only in-creases the complexity of the classification process, but also maysometimes decrease the performance of the classification task. Toovercome this problem, a reduced version of the original feature

set, can be used. For instance, in the speech/non-speech classifica-tion of audio segments, PIS-Petrosian and PMS-Petrosian are thebest discriminators (as will be discussed in Section 5.2.1), but theyhave considerable repeated information; so, it is sensible to employonly one of them in the classification process.

To follow this approach, we applied F-Ratio value (Cohen,1986). By running GA, which uses F-Ratio as its fitness function,we get forward to find an optimally pruned version of the originalcrude feature set.

5.2.1. F-RatioFisher discriminator analysis for a two-class problem, leads us

to the Eq. (14) as the Fisher’s measure or F-Ratio (Cohen, 1986):

F ¼ BW

ð14Þ

where B matrix, the ‘‘between class scatter matrix”, is given by

B ¼ ðl̂1 � l̂2Þðl̂1 � l̂2ÞT ð15Þ

where l̂i is the mean of the Ni samples of class wi in the n dimen-sional space. The W matrix is the sum of scatter matrices for twoclasses and is given by

W ¼W1 þW2 ð16Þ

Wi ¼Xb2wi

ðb� l̂iÞðb� l̂iÞT ; i ¼ 1;2 ð17Þ

It is expected that a feature vector with a higher F-Ratio hasmore discrimination power, and is a good measure to compare ex-tracted features in the involved task.Fig. 10. Distribution of duration over the speech and non-speech segments.

Fig. 11. Distribution of energy gradient over the speech and non-speech segments.

Fig. 9. (a) PIS-Petrosian distributions and (b) PMS-Petrosian distributions.

S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948 943

Page 10: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

To have some experiments in this field, first, F-Ratio has beenindividually computed the introduced features over the two classdataset consisting of the speech/non-speech segments. Fig. 12shows the F-Ratio values calculated for each of the 34 features ina left to right descending order.

These values are brought in Table 3, in a different manner.As can be seen in the table, fractal dimension-based features

have the highest F-Ratio value. By considering F-Ratio criterion,the best discriminator feature is known to be PMS-Petrosian.

5.2.2. Feature selection using GAIn the previous section, 34 distinct features, extracted from

audio segments, were compared to each other using the F-Ratiofactor. Employing all of these features in the classification processwill not necessarily lead to the best results. On the other hand,some features with high F-Ratio values have repeated informationand are redundant. For example, from Table 3 and Fig. 12, the bestdiscriminator features are two fractal features, PMS-Petrosian andPIS-Petrosian. But, these features have almost similar information.This kind of problems can significantly reduce the performance ofmost classifiers. To avoid these problems, GA, which is an adaptivefeature selection procedure, can be employed in order to choosethe best feature set. There is not a problem with the high complex-ity and low speed of the GA, since, the feature selection procedureneeds to be done only one time, and the found optimum feature setwill remain fixed to be employed in classification processafterwards.

It has been shown that, the GA is an efficient approach for largescale optimal subset selection problems (Xu and Chan, 2002;Selouani and O’Shaughnessy, 2004). The search algorithm in GAdoes not follow the steepest path of the error gradient, to findthe minimum value for the error function; in contrary it examinesvarious points in the search space based on a random approachwhich considers the laws of evolution, to minimize the error func-tion. Therefore, it is particularly appealing when such informationis unavailable or costly to obtain. GA, unlike the gradient-basedtraining algorithms, can handle the global search problem betterin a vast, complex, multi-modal and non-differentiable surface.Moreover, GA is generally much less sensitive to initial conditionsof the training data. GA always globally searches for an optimalsolution, while a gradient descent algorithm can only look for a lo-cal optimum in a neighborhood of the initial condition.

GA is basically inspired by the laws of natural evolution andgenetics. Its principal idea is to search for the optimal solution ina large population. It usually uses a fixed length continuous or dis-crete string, called chromosome, to represent a possible solution.Usually, a simple GA consists of three main operations called par-

ent selection, crossover and mutation. The population comprises agroup of chromosomes from which the candidates can be selected,as the possible solutions of a given problem. The initial populationoften consists of a group of individuals whose chromosomes arerandomly selected. During the optimization task, the appropriate-ness of the individuals, in the current generation, is evaluatedusing a user-defined fitness function.

In this work, each individual’s chromosome is made of a binarystring which represents a set of active or inactive genes. Activegenes’ indexes represent those components in the feature vectorwhich participate in the classification task.

To optimize the speech/non-speech classification task, the opti-mum number of features to be used as well as their combinationshould be determined. In this work, parent selection in GA worksbased on a stochastic uniform selection algorithm. The cross overfunction employs a so-called scattered function to generate chil-dren. This function creates a random binary vector and selectsthe genes where the vector is a 1 from the first parent, and thegenes where the vector is a 0 from the second parent, and com-bines the genes to form the child. The mutation function, adds arandom number taken from a Gaussian distribution with mean 0to each entry of the parent vector. In generation replacement, 2individuals are guaranteed to be survived in the next generation.The fitness function used in GA was F-Ratio discussed in Section5.2.1. GA searches for the feature vector which represents the high-est F-Ratio. To find the best feature vector, GA searches for the best1, 2, 3, . . ., 34-dimentional feature sets, distinctively. This is doneby running GA for 34 times, each run is for a certain size of the fea-ture set. As an example, for a known feature vector size, e.g. 15, GAhas been applied and it resulted in a vector with 15 features whichrepresented the highest F-Ratio value among other selections of 15dimensional vectors. This procedure has been applied for othervector sizes and finally, the F-Ratio values of the optimum featurevectors were compared. These values are shown in Fig. 13.

Fig. 12. F-Ratio values calculated individually for the 34 features extracted from thetwo-class audio segments.

Table 3The exact F-Ratio values calculated individually for the 34 features extracted from thetwo-class audio segments.

Feature F-Ratio

PMS-Petrosian 12.78PIS-Petrosian 6.27Mean-MFCC 1 6.16Mean-MFCC 3 5.83PMS-MCR 5.51Diff-MFCC 2 4.67Diff-MFCC 1 4.18PMS-EDC 2.94Diff-MFCC 3 2.68Mean-MFCC 12 2.54Diff-MFCC 4 2.39Mean-MFCC 10 2.02Diff-MFCC 5 1.77PIS-MCR 1.35Diff-MFCC 6 1.32Diff-MFCC 7 0.99Mean-MFCC 4 0.8PIS-EDC 0.73Mean-MFCC 6 0.71Mean-MFCC 5 0.71Diff-MFCC 8 0.65Mean-MFCC 8 0.58Diff-MFCC 9 0.45Diff-MFCC 10 0.43Mean-MFCC 13 0.33Mean-MFCC 2 0.28Diff-MFCC 11 0.28Mean-MFCC 7 0.28Diff-MFCC 12 0.27Diff-MFCC 13 0.18Mean-MFCC 9 0.14Mean-MFCC 11 0.08

944 S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948

Page 11: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

It can be seen that F-Ratio value has a considerable growth ratealong the feature sizes of 1–15, and after that, its growth decreasessignificantly. As, it is needed to keep the feature vector size smallas possible, this point has been chosen as the optimum point forthe mentioned problem. Using 15 features (instead of 34) in theclassification task reduces the computational complexity of thewhole SAD system significantly and makes it faster and more timeefficient. The best 15-dimensional feature set selected by GA con-tains PMS-Petrosian, PMS-MCR, PMS-EDC, PIS-EDC, indices 2–5 ofmax-diff-MFCC and indices 1, 2, 3, 5, 7, 8 and 9 of Mean-MFCC.These features have been extracted from audio segments and usedas inputs for the speech/non-speech classifier introduced in thenext section.

6. SVM, the speech/non-speech classifier

In order to discriminate speech segments from non-speechones, SVM is employed which is known as a useful technique inthe field of statistical learning theory (Burges, 1998; Vapinik,1998). This technique has been widely used in different signal pro-cessing, pattern recognition and classification applications (Justinoet al., 2005; Pal and Mather, 2004). Because of its good perfor-mance, especially in the case of two-class discrimination problems,this technique has been selected as the classification method inthis work. Here is a brief description of the algorithm; assume thatthe training data with k number of samples is represented by {xi,yi}, i= 1, . . ., k, where x 2 Rn is an n dimensional vector andy 2 {�1, + 1} is the class label as shown in Fig. 14. The aim is to finda hyper-plane that separates the data with the minimum error.This depends on finding w and b such that

yiðw � xi þ bÞ þ ni P 1 ð18Þ

where ni is the error value for ith sample of the training dataset.The optimal separating hyper-plane is achieved by minimiza-

tion of the following criterion.

minw;b;ni

12kwk2 þ C

Xn

i¼1

ni

" #ð19Þ

The first term in this criterion is used to set the learning capac-ity, and the second term controls the number of misclassifiedpoints. The regularization constant (C > 0) also determines thetrade-off between the empirical error and the complexity term.The parameter C is chosen intuitively by the user. A large valuefor C means a higher penalty of errors.

When it is not possible to have a linear hyper-plane, which sep-arates the training data, the technique can be extended to nonlin-ear decision surfaces. Such a procedure can be considered as themapping of input data to a higher-dimensional feature space, usingsome non-linear functions. The transformation into a higher-dimensional space can be achieved by a kernel and makes outlierdata points to be classified linearly. There are several kernel func-tions introduced in the literature such as polynomial, Radial BasisFunction (RBF), Multi-layer Perceptron (MLP), spline and Fourierseries. In this work, two kinds of kernels i.e. linear and polynomialare used in order to classify the speech and non-speech segments.Polynomial kernel of SVM is defined as follows.

Kðx; x0Þ ¼ ðGammahx; x0i þ Coeff Þd ð20Þ

where ‘‘Gamma” and ‘‘Coeff” are two constant values and d is thepolynomial degree. In this equation, the two dimensional input vec-tor (x,x0) is mapped into a higher dimensional feature space, wheretwo classes can be separated linearly.

7. Experimental results

7.1. Experiments on the segmentation block stage

Experiments have been conducted on the Farsi and Englishspeech signals separately. Farsi test set consists of about 30% ofFARSDAT speech corpus described in Section 3.1. Two criteria,tHR0 and tHR1 (described in Section 3.4), are exploited to evaluatethe performance of the segmentation block. These values for theclean test data are calculated 96.64% and 93.11%, respectively. Thistest is repeated for the test signals with additive white noise in dif-ferent levels of SNR. The results are brought in Table 4.

Similar experiments, using the system trained only over Farsidatabase, are conducted on the English utterances. The specifica-tion of the English test set was described in Section 3.1. Here, thecore test of 24 speakers of TIMIT database in the clean and noisyconditions has been used. The results are brought in Table 5.

7.2. Experiments on the speech/non-speech classifier stage

When an audio signal is fed into the first-stage (segmentationblock) of the SAD system, the output will be a number of audio seg-ments which must be then classified into speech/non-speech seg-ments. To implement the training process, 1325 speech segmentsextracted from FARSDAT database, besides 667 non-speech seg-ments of different kinds described in Section 3.2 have been used.The test set consists of 537 speech and 282 non-speech segments(without any overlap with the train set). The 15-dimensional fea-

Fig. 13. The best F-Ratio values obtained by GA for different numbers of feature setdimensions.

Fig. 14. SVM aim is to find a hyper-plane that separates the data classes with theminimum error.

S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948 945

Page 12: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

ture vectors, based on the optimized feature set discussed in Sec-tion 5.2.2, are extracted for the segments.

For the linear SVM, constant C is selected experimentally equalto 50. To find a proper value for this constant, we examined HR0and HR1 values over different values of C. By changing C from 1to 50, HR0 and HR1 values increased, but they decreased for C val-ues more than 50 (this may be due to over fitting).

First, SVM classifier has been trained by feature vectors ex-tracted from the clean segments. Due to applying this classifierto the clean test set, HR0 and HR1 values are obtained 100% and99.92%, respectively. This performance is degraded by adding back-ground noise to the test set. For additive white noise with SNR val-ues of 10, 5 and 0 dB, HR1 reduces to 66%, 55% and 47%,respectively. In these conditions, no considerable change in HR0value is seen.

In order to increase the efficiency of the classifier in presence ofbackground noise, linear SVM is trained with a noisy train set. Forthis purpose, different noise types including white, pink, babbleand factory are added to the training dataset, with different SNRvalues of 10, 5 and 0 dB. This noisy database is then added to theclean train dataset. In this manner, a new mixed train set is pre-pared. By renewing the train phase of the SVM classifier, new testresults obtained for HR0 and HR1, which are brought in Table 6.

To examine a more complex SVM over our favorite task, a SVMwith polynomial kernel is trained by the mixed noisy train set. Thepolynomial parameters of d, Gamma and Coeff values (in Eq. (20))are experimentally selected 4, 2 and 25, respectively. The test re-sults are brought in Table 7.

To show the language independency of the approach, SVM clas-sifier (trained only by Farsi train set) is examined over an Englishdatabase consisting of 2541 speech segments from TIMIT database(taken from the 24 speaker test core discussed in Section 3.1), plus950 non-speech segments (from those introduced in Section 3.2).The obtained results are brought in Table 8.

7.3. Speech recognition experiments

To evaluate the performance of the proposed SAD in a speechrecognition task, we have measured the WRR in an ASR systemwith and without using the proposed SAD. For this purpose, HTKtoolbox is employed; which could be configured as a HMM basedASR system. The baseline system doesn’t have any word languagemodel. The feature set used in the ASR system consists of 39 MFCC-based parameters (MFCCs plus their delta and delta-delta coeffi-cients). 6-state, 16-mixture left-to-right diagonal covarianceHMMs have been employed to model traditional phonemes, plusa 3-state silence model and a single-state short pause model. Theframe rate is 10 ms, with a frame length of about 23 ms. The testset is the 24 speaker core test of TIMIT which 950 non-speech seg-ments are added inside the silences intervals. Of course, differentstationary noises could be added to this mixed signal, to have moregeneral distorted conditions for the test set. Table 9 presents theWRR values obtained by applying the clean and some noisy ver-sions of the test set to the ASR system.

As shown in Table 9, in all cases, there is about 5–9% improve-ment in speech recognition accuracy. For clean test set, WER hasbeen decreased from 32% to 22.8% which shows a relative WERreduction improvement of as high as 28%.

Table 4tHR0 and tHR1 calculated for the segmentation block of the SAD system, applied tothe Farsi test signal, for different levels of additive noise.

Noise type SNR (dB) tHR0 (%) tHRI (%)

Clean signal – 96.64 93.11White 10 91.32 94.67

5 85.20 93.890 76.47 90.54

Pink 10 87.20 93.605 81.37 91.520 73.99 88.31

Babble 10 84.03 93.835 77.81 79.790 74.28 68.20

Factory 10 87.80 94.145 78.75 93.220 74.45 83.86

Table 5tHR0 and tHR1 calculated for the segmentation block of the SAD system, applied toEnglish utterances, for different levels of additive noise.

Noise type SNR (dB) tHR0 (%) tHRI (%)

Clean signal – 94 96White 10 91 93.2

5 89.2 88.30 81.4 83

Pink 10 92 93.45 88.2 830 82.4 79.8

Babble 10 91 92.45 86.1 86.30 77.3 80.4

Factory 10 90 925 84 84.10 73.4 77.3

Table 6HR0 and HR1 for the linear SVM classifier (train and test on Farsi).

Noise type SNR (dB) tHR0 (%) tHRI (%)

Clean signal – 99.47 95.39

White 10 95 865 94 830 91 83

Pink 10 96 845 96 770 96 71

Babble 10 98 765 97 700 97 62

Factory 10 97 905 97 830 97 83

Table 7HR0 and HR1 for the SVM classifier with polynomial kernel (train and test on Farsi).

Noise type SNR (dB) tHR0 (%) tHRI (%)

Clean signal – 99.47 95.39White 10 95 87

5 94 860 92 83

Pink 10 96 875 95 840 93 79

Babble 10 95 795 96 770 95 74

Factory 10 96 915 97 900 97 90

946 S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948

Page 13: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

8. Discussions

In this paper, a two-stage SAD is introduced. It is shown thatthis approach can be considered as a language independent tech-nique. As mentioned in the previous sections, A Farsi speech data-base is used to extract speech segments for determining thesystem parameters and training of the classifiers. The first-stageof the system is a segmentation block that rejects those parts ofsignal which can easily be indentified as non-speech, based on en-ergy and duration of the audio segments. Experiments conductedon English speech signal, Table 5, have shown the language inde-pendency of the system. The first row of Table 5 shows the seg-mentation results which are almost identical to the first row ofTable 4 for Farsi. This is also true for the noisy conditions as well.Outputs of the segmentation block are audio segments which mustbe classified as speech or non-speech. The final classification pro-cess is done by a SVM classifier, fed by an optimal subset of featurevectors. A SVM classifier has been trained over features extractedfrom Farsi speech and some non-speech segments of various kinds.The optimal reduced feature set consists of features, such as PMS-Petrosian, and some MFCC-based features. Of course, there aremany common phones and sub-phones in Farsi and English andthese languages are not inherently different in a sense of phonol-ogy. The comparison of the results shown in Tables 7 (for Farsi)and 8 (for English) show the efficiency of the system for both lan-guages. So, we can proclaim that the overall developed system islanguage independent to some extent and could be used for otherlanguages as well. The final test in Section 7.3 conducted on Eng-lish utterances, shows excellent results and can boost up the per-formance of the ASR system.

In SAD systems, performance degradation in presence of back-ground noise is an unresolved problem. To make the proposedSAD system robust against background noise, an adaptive energythreshold is employed in the first stage of the system Tables 4and 5 show acceptable performance of the segmentation block,in presence of background noises. In the second stage (speech/non-speech classifier), some noisy speech segments have been

added to the original train set, to make the classifier more robustagainst noisy signal. Tables 6–8 show the corresponding resultsof the noisy speech/non-speech segments classification, over Farsi.It is shown that in high energy background noise, the SVM classi-fier with a complicated kernel like ‘‘polynomial kernel” is more ro-bust to background noise than the linear SVM. But, in high SNRs,linear SVM, with a lower complexity, could have an acceptableperformance.

9. Conclusions

In this research, a two-stage speech activity detector is pro-posed consisting of two distinct cascaded systems. First, a segmen-tation block based on a FSM machine, which operates on theenergy and duration of the audio segments, rejects pauses of theinput signal and passes speech and non-speech audio segmentsto the next stage. This preprocessing unit benefits from an adaptiveenergy threshold to overcome background noises which varyslowly in time. The second stage of the system is consisted of aSVM classifier which considers a feature vector for every segmentto identify whether or not the segment is speech. An extensiveinvestigation had been conducted on two new fractal featureswhich are shown to be excellent discriminators for speech/non-speech classification. A GA is employed to reduce the feature setfrom 34 to 15 in order to reduce complexity and redundancy. Final-ly, conducting different experiments on Farsi and English test sets,it has been shown that our approach is to some extent languageindependent. The robustness of the system against backgroundnoises was shown to be acceptable. The proposed SAD was em-ployed in conjunction with an English HMM-based ASR system,which resulted in an improvement of 28% in relative WERreduction.

References

Abdulla, W.H., Kecman, V., Kasabov, N., 2003. Speech-background classification byusing SVM technique. In: 13th Internat. Conf. on Artificial Neural Networks(ICANN’03), pp. 310–315.

Banbrook, M., McLaughlin, S., 1994. Is speech chaotic?: Invariant geometricalmeasures for speech data. In: IEE Colloquium on Exploiting Chaos in SignalProcessing, p. 810.

Basu, S., 2003. A linked-HMM model for robust voicing and speech detection. In:Proc. IEEE Conf. on Acoustic, Speech and Signal Processing (ICASSP’03), vol. 1,pp. 816–819.

Beaufays, F., Boies, D., Weintraub, M., Zhu, Q., 2003. Using speech/non-speechdetection to bias recognition search on noisy data. In: Proc. IEEE Conf. onAcoustic, Speech and Signal Processing (ICASSP’03), vol. 1, pp. 424–427.

Bijankhan, M., Sheikhzadegan, M.J., 1994. FARSDAT- the Farsi spoken languagedatabase. In: Proc. Fifth Australian Internat. Conf. on Speech Sciences andTechnology (SST’94), vol. 2, pp. 826–829.

Bijankhan, M., Sheikhzadegan, M.J., Roohani, M.R., Zarrintare, R., Ghasemi, S.Z.,Ghasedi, M.E., 2003. Tfarsdat – the telephone Farsi speech database. In: Proc. ofthe Eurospeech 2003, pp. 1525–1528.

Burges, C.J.C., 1998. A tutorial on support vector machines for pattern recognition.Data Mining Knowledge Discovery 2 (2), 121–167.

Chen, S.H., Wu, H.T., Chen, C.H., Ruan, J.C., Truong, T.K., 2005. Robust voice activitydetection algorithm based on the perceptual wavelet packet transform. In: Proc.IEEE Internat. Symposium on Intelligent Signal Processing and CommunicationSystems (ISPACS’05), pp. 45–48.

Cohen, A., 1986. Biomedical Signal Processing, vol. 2. CRC Press.Enqing, N., Guizhong, L., Yatong, Z., Xiaodi, Z., 2002. Applying support vector

machines to voice activity detection. In: Proc. Internat. Conf. Signal Processing(ICSP’02), vol. 2, pp. 1124–1127.

Esteller, R., Vachtsevanos, G., Echauz, J., Litt, B., 2001. A comparison of waveformfractal dimension algorithms. IEEE Trans. Circuits Syst. 48 (2), 177–183.

FREESOUND. <http://freesound.iua.upf.edu/>.Górriz, J.M., Puntonet, C.G., Ramı´ rez, J., Segura, J.C., 2005. Statistical tests for voice

activity detection. In: ISCA Tutorial and Research Workshop on Non-linearSpeech Processing (NOLISP’05), pp. 240–249.

Hilborn, R.C., 2001. Chaos and Nonlinear Dynamics, second ed. Oxford University.Hori, C., Furui, S., 2003. A new approach to automatic speech summarization. IEEE

Trans. Multimedia 5 (3), 368–378.HTK toolbox, <http://htk.eng.cam.ac.uk/>.Hussain, A., Abdul Samad, S., Fah, L.B., 2000. Endpoint detection of speech signal

using neural network. In: Proc. IEEE TENCON, vol. 1, pp. 271–274.

Table 8HR0 and HR1 results for the SVM classifier with polynomial kernel (train by Farsi andtest with English).

Noise type SNR (dB) tHR0 (%) tHRI (%)

Clean signal – 99.6 95.3White 10 96.8 88.2

5 93.6 85.30 92.4 84.5

Pink 10 95.7 865 93.6 85.90 94 81.3

Babble 10 92.2 78.25 93 75.70 91.7 75

Factory 10 98.2 925 96 89.10 93.5 86.9

Table 9WRR values using SAD at the input of the ASR system.

Signal WRR with SAD WRR without

Clean 77.2 68White-10 dB 71.4 63.2Pink-10 dB 66.8 60.4Babble-10 dB 65.6 58.7Factory-10 dB 64.3 59.5Volvo-10 dB 66.3 58.9

S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948 947

Page 14: Author's personal copy - sharif.irsharif.ir/~bahram/papers/JPRL.pdf · Author's personal copy A two-stage speech activity detection system considering fractal aspects of prosody Soheil

Author's personal copy

Justino, E.J.R., Bortolozzi, F., Sabourin, R., 2005. A comparison of SVM and HMMclassifiers in the off-line signature verification. Pattern Recognition Lett. 26 (9),1377–1385.

Kokkinos, I., Maragos, P., 2005. Nonlinear speech analysis using models for chaoticsystems. IEEE Trans. Speech Audio Process. 13, 1098–1109.

Koumpis, K., Renals, S., 2001. The role of prosody in a voicemail summarizationsystem. In: Proc. of the ISCA Workshop on Prosody in Speech Recognition andUnderstanding, Red Bank, NJ, pp. 87–92.

Martin, A., Karray, L., Gilloire, A., 2000. High order statistics for robust speech/non-speech detection. In: European Signal Processing Conference (EUPSICO 2000),pp. 469–472.

Martin, A., Mauuary, L., 2006. Robust speech/non-speech detection based on LDA-derived parameter and voicing parameter for speech recognition in noisyenvironments. Speech Comm. 48 (2), 191–206.

Mauuary, L., 1994. Improving the performances of interactive voice responseservices. Ph.D. thesis, University of Rennes (in French).

NOISEX-92. <http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html>.

Ramirez, J., Sagura, J.C., Benitez, C., de la Torre, A., Rubio, A., 2004. Efficient voiceactivity detection algorithms using long-term speech information. SpeechComm. 42, 271–287.

Ramirez, J., Sagura, J.C., Benitez, C., de la Torre, A., Rubio, A., 2005. An effectivesubband OSF-based VAD with noise reduction for robust speech recognition.IEEE Trans. Speech Audio Process. 13 (6), 1119–1129.

Padrell, J., Macho, D., Nadeu, C., 2005. Robust speech activity detection using LDAapplied to FF parameters. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, andSignal Processing (ICASSP’05), vol. 1, pp. 557–560.

Pal, M., Mather, P.M., 2004. Assessment of the effectiveness of support vectormachines for hyperspectral data. Future Generation Comput. Syst. 20 (7), 1215–1225.

Prasad, R.V., Sangwan, A., Jamadagani, H.S., Chiranth, M.C., Shah, R., Gaurav, V.,2002. Comparison of voice activity detection algorithms for VOIP. In: Proc.Seventh Internat. Symposium on Computers and Communications (ISCC’02), pp.530–535.

Savoji, M.H., 1989. A robust algorithm for accurate endpointing of speech signals.Speech Comm. 8 (1), 46–60.

Selouani, S.A., O’Shaughnessy, D., 2004. Robustness of speech recognition usinggenetic algorithms and a mel-cepstral subspace approach. In: IEEE Internat.Conf. Acoustics Speech and Signal Processing (ICASSP’04), vol. 1, p. 201.

Shafiee, S., Almasganj, F., Jafari, A. 2008. Speech/non-speech segments detectionbased on chaotic and prosodic features. In: Proc. Interspeech 2008, pp. 111–114.

Shaojun, J., Haitao, G., Fuliang, Y., 2004. A New algorithm for VAD based on wavelettransform. In: Proc. Internat. Symposium on Intelligent Multimedia, Video andSpeech Processing, pp. 222–225.

Shen, J.L., Hung, J., Lee, L.S., 1998. Robust entropy based end point detection forspeech recognition in noise. In: 5th Internat. Conf. on Spoken LanguageProcessing (ICSLP’98), vol. 3, pp. 1015–1018.

Skorik, S., Berthommier, F., 2000. On a cepstrum-based speech detector robust towhite noise. In: Proc. Internat. Conf. on Speech and Computer (Specom 2000).

Tian, Y., Wang, Z., Lu, D., 2002. Non-speech segment rejection based on prosodicinformation for robust speech recognition. IEEE Signal Process. Lett. 9 (11), 364–367.

Vapinik, V., 1998. Statistical Learning Theory. Wiley, NewYork.Villarrubia, L., Acero, A., 1993. Rejection techniques for digit recognition in

telecommunication applications. In: Proc. IEEE Conf. on Acoustic, Speech andSignal Processing (ICASSP’93), vol. 2, pp. 455–458.

Waheed, K., Weaver, F.M. Salam, 2002. A robust algorithm for detecting speechsegments using an entropic contrast. In: The 45th Midwest Symposium onCircuits and Systems, vol. 3, pp. 328–331.

Wu, B.F., Wang, K.C., 2005. Robust endpoint detection algorithm based on theadaptive band-partitioning spectral entropy in adverse environments. IEEETrans. Speech Audio Process. 13 (5, Part 2), 762–777.

Xianbo, X., Guangshu, H., 2005. An incremental support vector machine basedspeech activity detection algorithm. In: Proc. 27th Annual Internat. Conf. of theEngineering in Medicine and Biology Society (IEEE-EMBS’05), pp. 4224–4226.

Xu, P., Chan, A.K., 2002. Optimal wavelet sub-band selection using geneticalgorithm. In: IEEE Internat. Symposium on Geoscience and Remote Sensing(IGARSS’02), vol. 3, pp. 1441–1443.

Ying, G.S., Mitchell, C.D., Jamieson, L.H., 1993. Endpoint detection of isolatedutterances based on a modified Teager energy measurement. In: Proc. IEEE Conf.on Acoustic, Speech and Signal Processing (ICASSP’93), vol. 2, pp. 732–735.

948 S. Shafiee et al. / Pattern Recognition Letters 31 (2010) 936–948