Vo ice Activity Detection and Noise Estimation for ...852787/FULLTEXT01.pdf · VOICE ACTIVITY DETECTION AND NOISE ESTIMATION FOR TELECONFERENCE PHONES Submitted in partial ful llment

Voice Activity Detection and Noise Estimation for Teleconference Phones

Björn Eliasson

June 20, 2015

Student Master’s Thesis, 30 Credits Department of Mathematics and Mathematical Statistics

Copyright c© Bjorn Eliasson.All rights reserved

VOICE ACTIVITY DETECTION AND NOISE ESTIMATIONFOR TELECONFERENCE PHONESSubmitted in partial fulfillment of the requirement for the degreeMaster of Science in Industrial Engineering and ManagementDepartment of Mathematics and Mathematical StatisticsUmea UniversitySE-901 87 Umea,Sweden

Supervisors:Jun Yu, Umea UniversityNils Ostlund, Konftel AB

Examiner:Patrik Ryden, Umea University

I

Abstract

If communicating via a teleconference phone the desired transmitted signal (speech)needs to be crystal clear so that all participants experience a good communicationability. However, there are many environmental conditions that contaminates thesignal with background noise, i.e sounds not of interest for communication purposes,which impedes the ability to communicate due to interfering sounds. Noise can beremoved from the signal if it is known and so this work has evaluated different waysof estimating the characteristics of the background noise. Focus was put on usingspeech detection to define the noise, i.e. the non-speech part of the signal, but othermethods not solely reliant on speech detection but rather on characteristics of thenoisy speech signal were included. The implemented techniques were compared andevaluated to the current solution utilized by the teleconference phone in two ways,firstly for their speech detection ability and secondly for their ability to correctlyestimate the noise characteristics. The evaluation process was based on simulationsof the methods’ performance in various noise conditions, ranging from harsh to mildenvironments. It was shown that the proposed method showed improvement overthe existing solution, as implemented in this study, in terms of speech detectionability and for the noise estimate it showed improvement in certain conditions. Itwas also concluded that using the proposed method would enable two sources ofnoise estimation compared to the current single estimation source and it was sug-gested to investigate how utilizing two noise estimators could affect the performance.

Keywords: Voice Activity Detection (VAD), noise estimation, continuous noiseestimation (CNE), statistical model-based VAD, improved minima-controlled recur-sive average (IMCRA), Rangachari noise estimation (RNE or MCRA-2), likelihoodratio approach, signal-to-noise ratio dependent recursive average, teleconferencing

II

Sammanfattning

Nar man kommunicerar via en konferenstelefon kravs att signalen (tal) som sandsar tillrackligt klar for att alla parter ska uppleva en god kommunikationsformaga.I praktiken finns det manga miljomassiga faktorer som kontaminerar signalen medbakgrundsbrus, d.v.s ljud som inte ar intressanta i ett kommunikationsperspektiv,och forsvarar kommunikation pa grund av storande ljud. Bakgrundsbrus kan re-duceras fran den sanda signalen ifall dess karaktaristik ar kant och darfor har olikametoder att uppskatta bakgrundsbrusets karataristik utvarderats. Fokus lag pa attanvanda taldetektering for att definiera bruset, d.v.s. signal utan tal, men avenandra metoder som utnyttjar den brusiga signalens karaktaristik inkluderades. Deimplementerade metoderna jamfordes med och utvarderades mot den nuvarandelosningen for brusskattning pa tva vis, for det forsta for formagan att korrekt kunnadetektera tal och for det andra for formagan att korrekt kunna karaktarisera bruset.Utvarderingsprocessen baserades pa en simuleringsstudie av metoderna i flertaletolika brusmiljoer som spanner intervallet milda till valdigt harda forhallanden. Detpavisades att den metod som foreslogs visade pa en forbattring i jamforelse med denexisterande losningen, sasom implementerad i denna studie, gallande taldetekteringoch brusskattning i vissa forhallanden. Vidare kommer den foreslagna metoden getillgang till tva kallor for brusskattning till skillnad mot den nuvarande losningensom har en. Det foreslogs for vidare studier att studera hur dessa tva olika kallorkan kombineras.

III

Acknowledgements

I would like to thank my extraordinary supervisor Professor Jun Yu at Umea Uni-versity for all the amount of time spent helping me out, be it encouraging words orlending of expertise or a thorough report review. Moreover, I direct a special thanksto my Konftel supervisor Dr. Nils Ostlund for providing insight and keeping me ontrack throughout the project. Also, thank you members of project group Frost forletting me partake in the daily work at Konftel, it has been such a great learningexperience. Lastly I would like to express my sincerest thank you to everyone atKonftel for being so incredibly nice to me this semester. It made me feel right athome!

IV

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Noise and Speech . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Noise Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Theory 92.1 Acoustic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Sinusoidal and Complex Waves . . . . . . . . . . . . . . . . . 92.1.2 Decibel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Representing an Analogue signal . . . . . . . . . . . . . . . . 112.2.2 Discrete-Time Fourier Transform . . . . . . . . . . . . . . . . 112.2.3 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . 122.2.4 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.5 Frame Processing . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Additional Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Loss, Risk and Bayes Risk function . . . . . . . . . . . . . . . 142.3.2 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Maximum Likelihood for the HMM . . . . . . . . . . . . . . . 172.4.2 Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . 18

3 Method 193.1 VAD Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Aurora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.2 ETSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.3 Statistical Model-Based VAD . . . . . . . . . . . . . . . . . . 21

3.2 CNE Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Likelihood Ratio Approach . . . . . . . . . . . . . . . . . . . 26

V

3.2.2 Improved Minima-Controlled Recursive Averaging . . . . . . 263.2.3 Rangachari Noise Estimation . . . . . . . . . . . . . . . . . . 283.2.4 SNR Dependent Recursive Averaging . . . . . . . . . . . . . 28

3.3 Methods of Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.1 Comparison of VAD Methods . . . . . . . . . . . . . . . . . . 293.3.2 Comparison of Noise Estimation . . . . . . . . . . . . . . . . 30

3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.1 Test Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.2 Implementation of VAD Comparison . . . . . . . . . . . . . . 323.4.3 Implementation of Noise Estimation Comparison . . . . . . . 35

4 Results 364.1 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Cafeteria Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1.2 Street Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.3 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Noise Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.1 Cafeteria Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.2 Street Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.3 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Comfort Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Discussion and Conclusion 515.1 VAD Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Statistical Model-Based VAD . . . . . . . . . . . . . . . . . . 515.1.2 ETSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.1.3 Aurora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Noise Estimation Evaluation . . . . . . . . . . . . . . . . . . . . . . 545.2.1 Continuous Noise Estimation . . . . . . . . . . . . . . . . . . 545.2.2 ETSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.3 Aurora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 VAD vs. CNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.4 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4.1 Evaluation Format . . . . . . . . . . . . . . . . . . . . . . . . 565.4.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . 57

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.6 Further Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

VI

Abbreviations

CNE Continuous Noise EstimationDFT Discrete Fourier TransformDTFT Discrete-Time Fourier TransformDD Decision-DirectedETSI European Telecommunications Standards InstituteFFT Fast Fourier TransformHMM Hidden Markov ModelIMCRA Improved Minima-Controlled Recursive AverageLRA Likelihood Ratio ApproachLRT Likelihood Ratio TestMedSE Median Squared ErrorMMSE Minimum Mean Square ErrorMSE Mean Squared ErrorPSD Power Spectral DensityROC Receiver Operating CharacteristicsSNR Signal-to-Noise-RatioSNRDRA Signal-to-Noise-Ratio Dependent Recursive AverageSMVAD Statistical Model-Based Voice Activity DetectorVAD Voice Activity Detector

VII

Chapter 1

Introduction

This chapter introduces the background together with the aim, scope and limitationsfor this work.

1.1 Background

Being a global contender in the teleconference solution scene is no easy feat andrequires good products. An example of a product offered by Konftel can be seenin Figure 1.1. In a teleconference setting good products equal high quality audio,which is imperative for the users to be able to communicate properly and withease. One necessity in creating good audio is a good background noise estimate.An estimation of the background noise, or noise, is important for many aspects ofgenerating high quality audio. It is needed for reducing the noise transmitted to theother participants in the teleconference and a part in allowing for the echo createdwhen the microphone picks up the loudspeaker signal to be canceled out. One wayto estimate the noise is via a voice activity detector (VAD) which as the namesuggests tries to detect the presence of speech. Knowing when someone is speakingis a tool to estimate the noise, i.e. the sound of non-speech. Other methods rely onkey characteristics of speech to update the noise estimate, these methods are hereincalled continuous noise estimators (CNE). The basics of both these two differentstrategies will be explained in the following subsections. Not only must an efficientnoise estimator be able to well represent the background noise, it must also be ableto do so in real-time with minimum delay. The implementation and evaluation ofnoise estimation methods will be the subject of this work.

1.1.1 Noise and Speech

As this work is concerned with estimating the background noise it is reasonableto give an explanation to what noise is even though it could appear trivial. Noiseis everywhere around us and constitutes what everyone perceive as sounds not of

1

Figure 1.1: Teleconference phone Konftel 300IP.

interest, which in the case of teleconferencing translates to all sounds that do notoriginate from a participant speaking. In essence, this means that every signal canbe decomposed into two parts, the speech signal and the noise signal. Togetherthey form the noisy speech signal. As an example of noise, in a quiet room onesperception would often be that there is complete silence however this is not true asthere is always a noise background, as long as you are in a medium where soundcan exist. More noticeable is the noise background in other environments like in abusy street, the office or inside a restaurant because here the sounds coming towardsyou are more prominent. The are two types of noise, stationary and non-stationarynoise. For stationary noise the noise characteristics does not change over time (e.g.a fan) and for non-stationary noise it does change over time (e.g. an acceleratingcar or people talking next door). Intuitively it would seem that stationary noisewould be easier to estimate then its counterpart. Not surprisingly, this is very muchthe case as a method trying to estimate ever-changing noise characteristics must beable to adapt to changes constantly whereas trying to estimate stationary noise youonly need to characterize the noise once.

There are two main groups of speech, voiced and unvoiced speech. When thevocal folds are tensed and air is pushed through the vibration results in voicedsounds such as vowels. Unvoiced sounds are produced when the vocal folds do notvibrate but tense up and come closer together allowing the air stream to becometurbulent. ’H’ in house is an example of an unvoiced sound. ’S’ and ’t’ are alsounvoiced sounds, acquired when the tongue and and lips impose limitations to thevocal tract. Not surprisingly different types of spoken sounds are more or less easyto detect in a noise background as they more or less resemble noise [17].

2

1.1.2 Noise Estimation

There are two main strategies for estimating the noise. One strategy relies on aVAD decision and the other is based on a CNE scheme which utilizes some keycharacteristic of the noisy speech signal [17]. But before the basics of these methodsare discussed it would be good to get a better understanding of why a noise estimateis needed. The importance of a good noise estimation procedure is made clearthrough its use in speech enhancement, including noise suppression.

Noise Suppression and Other Uses

The importance of noise suppression is explained in [13] with the teleconference set-ting as an example. In a conference call the background noise for each participantis picked up and additively combined at the network bridge. This means that eachof the loudspeakers will reproduce the combined sum of the background noises fromthe other participants. As the number of participants increases the combined back-ground noise will overpower the desired signal making communication impossible.This makes it clear that the noise must be attenuated without affecting the speech,which is much of the issue when dealing with noise suppression. A noise suppres-sion system will remove the estimated noise from the noisy speech signal and theresulting signal will hopefully contain speech only. This problem of removing theestimated noise from the noisy signal shows a difference in over and underestimatingthe noise. Overestimation can cause speech distortion as now too much of the noisyspeech signal is removed, even speech. Underestimation of the noise can lead tothe background noise still being present in the noise suppressed signal. However, inpractice this is not exactly how it works but it still gives an idea of the difference inover and under estimating the noise.

While noise suppression is a very important aspect, and the main focus of thiswork, it is not the only thing the noise estimation is used for. In communica-tion devices it is common to use comfort noise which is simply an estimated noisebackground that is transmitted to the far-end user as an assurance that the con-nection is still working. Comfort noise may also be used to cover some residualecho making it less audible. The process of transmitting the noise range from usingstatic colored noise (noise with more power in some part of the frequency range)to adaptive schemes trying to emulate changes in the noise process [12] and it isfor these latter methods an up-to-date noise estimate is needed. A noise estimateis also needed for echo cancellation, i.e. the practice of removing the echo createdwhen the teleconference phone’s mic picks up its own loudspeaker signal, as well asdouble-talk-detection which tries to detect when two parties of the teleconferencespeak simultaneously.

3

Voice Activity Detection

The idea behind a VAD is simple and can be summarized in three steps. First asignal feature is extracted. Secondly a decision rule is employed deciding whetherthese features are that of speech or noise. Thirdly it is common to use some kind ofdecision alteration which is usually more empirical in nature and tuned to specificneeds [22]. When a VAD is used for noise estimation it estimates the noise duringnoise periods only, i.e. when no speech is detected it treats the entire incomingsignal as noise.

Several different signal features and decision rules have been suggested for speechdetection throughout the years as there is no clear all-defining feature that catches allthe complexity that is speech. In an overview by Ramırez in [22] the most commonfeatures used for a VAD are explained. Tracking the energy of the signal is a usefuland intuitively simple method as it can be assumed that speech contains moreenergy than noise. Here the presence of speech is assumed when the signal energyis greater than some threshold. These energy measure based VADs are both used intime and frequency domain (see section 2.1 and section 2.2). As an addition to theenergy based thresholding scheme some methods use frequency analysis tools basedon tracking the minimum and maximum energies of the low and high frequencies.Assuming an initial noise period the energy envelope can be tracked to be comparedto incoming energy values using a simple difference measure which in turn is used inthe VAD decision logic. Other methods assume that there are inherent differencesbetween speech and noise in terms of periodicity of the signal, frequency or pitch(see section 2.1). Pitch is a non-linear function of frequency [9].

While these VADs are heuristic in nature other researchers have focused ondeveloping statistical models for speech detection. In [26] a VAD was introducedthat modeled the speech and noise as independent Gaussian processes. This methodwas later improved by adding contextual information into the decision rule, howeverthis introduction lead to the methods no longer being causal [23]. Other methodsuse statistical models based on a Laplacian or Gamma distributions [3].

Continous Noise Estimation

The idea of the CNE schemes is to utilize some key characteristics of speech and thenoisy speech signal to be able to constantly update the noise estimate regardless ofspeech presence. There are three main classes of CNE algorithms, the time-recursiveaveraging algorithms, the minimum tracking algorithms and the histogram basedalgorithms, each based on one or more of the three key characteristics of speech.These classes are the The first characteristic is the fact that ”silent” portions ofspeech do not only occur in noise periods when looking at a frequency band (a subsetof all frequencies, e.g. all frequencies between 0 and 8000 Hz). As an example alow-frequency vowel will affect the lower part of the frequencies, enabling estimationof the noise in the higher frequencies. In short it is possible to update the noise

4

(a) (b)

(c)

Figure 1.2: Visualization of the three key characteristics of speech and the noisyspeech signal utilized by the CNE. In a) speech is shown to exist only in the lowerparts of the frequency band (assuming speech has more energy than noise), allowingfor estimation of the noise in the upper frequencies where the energy level is lowand thus assumed to be noise only. b) show the idea behind the minimum statisticsalgorithms. As can be seen the power of the speech signal (red) decays to the powerof the noise signal (blue) between utterances. In c) the histogram of the logarithmicenergy level of the noisy speech signal show the most common energy level taken tobe the background noise power level.

for every frequency not containing speech. The time-recursive averaging algorithmsexploit this. The second characteristic is that the noisy speech signal often decaysto the power of the noise and so by remembering what this lowest energy level wasis a tool to estimate the background noise. This is the idea behind the minimumtracking algorithms. The third characteristic is that a histogram of the energyvalues for every frequency reveals the most common energy level, taken to be thenoise energy level. This assumption leads to the histogram-based noise estimationschemes [17]. Figure 1.1 show a visualization of the three mentioned characteristicsof speech and the noisy speech signal utilized by the CNE.

The reasoning behind having a constant update is to have a more recent noiseestimate compared to a VAD-based noise estimate when long speech segments arepresent (or assumed to be present). This is especially important in the case of non-stationary noise (e.g. noise from inside a cafeteria) where a VAD might indicate along speech period, all the while the background noise changes, making the noise es-timate outdated. The VAD might even be performing well, i.e. it correctly classifies

5

the long period as speech, but this does not change the fact that the noise estimatehas not been updated in a long period. In the case of stationary noise a continuousupdate scheme loses its purpose as a few noise estimates would be sufficient to fullycharacterize the noise. Salas in [25] present a good overview of different types ofCNE schemes from all three groups.

Difficulties in Estimating the Noise

The biggest difficulty facing anyone trying to characterize the noise is the fact thatin practice only the noisy speech signal picked up by the microphone is available foranalysis. In other words the problem of noise estimation boils down to separatingwhat is the speech signal and what is the noise signal with only the informationavailable in the noisy speech signal. As you are trying to separate these two signalsit is inevitable that the speech signal will affect the noise signal estimate and viceversa. The goal is to minimize these effects to allow for the most accurate estimateof each separate signal.

When using a VAD-based noise estimation scheme this problem occur whenthe incoming signal is classified as containing noise only while in reality it containsspeech as well. This would mean that speech components will be incorporated in thenoise estimate as now the entire signal is treated as containing only noise while inreality it contains both noise and speech. To minimize these effects it is common toonly mark the signal as noise when you are very certain it is noise only. However, thiswould of course increase the chance of background noises being wrongly classifiedas speech. The CNE does not use the binary speech decision and so the problemof the two signals affecting each other’s estimate takes on another form but is stillpresent.

Current Noise Estimation Solution

The background noise is currently estimated as part of the Aurora Audio Algorithmemployed in the teleconference phones, henceforth known as only Aurora, by the useof a VAD. As previously mentioned relying on a VAD to characterize the backgroundnoise introduces a few possible kinks. The noise estimate will only be updated duringperiods marked as containing noise only and this becomes a problem for Aurora intwo ways. Firstly the Aurora algorithm is very sensitive to non-stationary noisemeaning that in these noise conditions the VAD will be tricked into believing thatthere is speech present in the signal. This ultimately leads to the noise estimatebecoming outdated when long segments of the signal is falsely deemed to containspeech. Secondly, a problem with the current method is its dependence on signalstrength. With its current design it will interpret strong signals as speech, eventhough it might just be a strong noise signal, which in the end yield the same result aspreviously, i.e. falsely marked long speech segments makes the noise estimate becomeoutdated. A real example of the signal strength being an issue is the practice of using

6

strong white noise generators in office landscapes to create a more preferably noisebackground. These generators can emit very high sounds to counter the originalbackground noise and this strong signal will be interpreted as speech by the AuroraVAD. Both of these two issues can be attributed to the empirical nature of theAurora implementation. Having a empirically tuned VAD method makes it hard toknow how it will perform in various conditions and therefore it is important thatthe proposed method is not as empirical as to introduce new unforeseen problems.This is of course hard to verify but can be taken into account by choosing a triedand tested method.

Possible Solutions

There are two possible solutions to these problems. The first would simply be toimprove upon the VAD’s speech detection ability to make it better at distinguishingbetween noise and speech. Another possible solution would be to implement aCNE not reliant on the binary decision of a VAD to update the noise estimate andtherefore possibly better and dealing with non-stationary noise.

1.2 Aim

To stay ahead in a competitive environment product improvements are crucial. Asthe work of developing a new generation of teleconference phones is on its way theidea was to research alternatives to the current noise estimator. In order to improveupon the current noise estimator both possible solutions alternatives discussed abovewill be evaluated.

The aim of this work was twofold. The first part was to implement and evaluatedifferent VAD methods and compare these to the current VAD implemented in theteleconference phones in terms of speech detection ability. The second part includesthe implementation and evaluation of different methods for noise estimation only,i.e. trying to correctly characterize the noise.

1.3 Scope and Limitations

This work will be limited to three methods of VAD implemented and analysed as wellas four CNE methods. The performance of the chosen methods will be evaluatedin various conditions associated with the teleconference setting and compared tothe existing solution and each other. These settings include different noise strengthcombined with various noise environments. The speech and noise conditions used forthe comparison will be sound files recommended by International CommunicationsUnion for use as test signals. The sound-files used are a subset of the sound filesdiscussed in the ITU-T P.501 standard [14]. The evaluation process will be limited

7

to simulations using the aforementioned sound files and no real-time implementationin a conference telephone will be done.

There are physical as well as computational limitations to the platform in whichthe noise estimator aims to be implemented in. The proposed estimation schemesmust take these limitations into consideration. This means that any proposedmethod must be able to handle a real-time implementation with minimum delayas not to interfere with the communication. The proposed methods cannot be tocomputationally complex. Another important issue is that the method must becausal. This is a restriction imposed on the system to not cause to much time-delaywhen processing the signal.

1.4 Outline

This work is structured as follows: Chapter 2 introduces the relevant theory aboutacoustics, digital spectral analysis, loss and risk functions, the likelihood ratio testand hidden Markov models. Any reader familiar with any of these subject may skimthrough the chapters to familiarize yourself with the notation used for this work.In chapter 3 the chosen methods for VAD and CNE are introduced along with theperformance evaluation format and implementation specifics. Chapter 4 presentsthe results and finally in Chapter 5 the results are discussed and conclusions aredrawn and lastly further work is proposed.

8

Chapter 2

Theory

This chapter gives a short introduction to acoustics theory, spectral analysis, lossand risk functions, the likelihood ratio test and hidden Markov models. If any ofthese subjects are familiar the reader is suggested to skip or skim through parts ofthis section, however it is recommended to familiarize yourself with the notationused..

2.1 Acoustic Theory

Sound can be defined in two ways, either as a physical wave propagating throughany elastic medium, or as the excitation of our hearing mechanism resulting in thepsychophysical perception of sound [9].

2.1.1 Sinusoidal and Complex Waves

The wave form that often describes sound, or various other kinds of signals, isthe sinusoidal wave. To define a periodic sinusoidal wave the signal amplitude,frequency and phase are needed. The amplitude is the absolute value of the signal.The frequency is the number of complete periods per second, where one period isthe time between two wave peaks, and is measured in Hertz (Hz). The phase isa shift along the time axis and indicates where the first zero crossing occur. Thesinusoidal wave can be expressed as

x(t) = A · cos(ωt+ φ). (2.1)

where t is a time unit (a signal expressed like this is a called a time-domain signalas it is a function of time), A is the signal amplitude, φ is the phase in radians andω are radians per time unit. ω can be used to derive the wavelength of the wave asT = 2π

|ω| . The wavelength is the distance the wave travels in one period [21]. In the

case of sound the sinusoidal wave represents the degree of displacement (compressionand rarefaction) of air particles in relation to the prevailing atmospheric pressure.

9

Figure 2.1: Sine wave in acoustical application

Figure 2.1 shows the basics of the sinusoidal wave in acoustical application. Thesimple sinusoidal wave doesn’t seem to be of very much use for representing thecomplex wave of speech, since the wave shapes of speech look drastically differentfrom the simple sinusoidal wave. However, no matter what shape the wave is it canbe reduced to sinusoidal wave components as long as it is periodic. This means thatany periodic complex wave can be synthesized using sinusoidal waves of differentamplitudes, frequencies and phases [9].

Expressing the sinusoidal signals in terms of complex exponentials makes someuseful tools to analyze the signal available, e.g the Fourier decomposition of signalsdiscussed in the next section. By Euler’s formula the sinusoidal wave relates to thecomplex exponential by

eiωt = cos(ωt) + i sin(ωt).

where i denotes the imaginary unit. Hence the sinusoidal wave from equation (2.1)can be expressed in terms of complex exponentials as [21]

A · cos(ωt+ φ) =A

2eiφeiωt +

A

2e−iφe−iωt

2.1.2 Decibel

Any power level of magnitude W1 can be expressed in terms of a reference powerW2 as

L1 = 10 · log10W1

W2decibels.

Magnitudes other than acoustic power can be expressed in dB. For example acousticpower is proportional to the squared acoustic pressure, p, hence the power level is

Lp = 20 · log10p1p2

decibels. (2.2)

10

These two equations define two useful relationships between power levels. As soundpressure is a common parameter to measure in acoustics equation (2.2) is often used[9]. Decibel is important to the concept of signal-to-noise ratio (SNR) as it is bydefinition the ratio of the signal power and the noise power usually measured in dB.

2.2 Spectral Analysis

There are several advantages to moving into the frequency, or spectral, domainwhen analysing signals. In spectral domain the signal is expressed as a function ofdifferent frequencies, in contrast to time domain where it is expressed as a functionof time. First of all the spectral domain yields a better separation of speech andnoise as these signals usually contain different frequency information. E.g speechdoes not exist in the very high frequencies while noise can. Naturally, this makes iteasier to implement an optimal or heuristic approach to VAD and CNE. Secondlyin the spectral domain the spectral components are decorrelated, which means thatto some extent the frequency information can be treated independently, simplifyingstatistical models [1]. Before the spectral analysis can begin a few concepts need tobe brought about.

2.2.1 Representing an Analogue signal

For a computer to analyze an analogue signal it must be able to convert it into adigital signal. To do this an analogue-to-digital converter is used resulting in theprocess of sampling. Sampling an analogue signal means measuring it on discretetime intervals and therefore the sampled signal is merely a discrete representationof the continuous signal. So given a signal x(t), where t denotes a continuous timevariable, the sampling model replaces t with the discrete value nTs. The discretevalue n is used to index an array with the sampling period Ts as the time betweeneach sample. The sampling process can be described as [30].

x[n] = x[nTs]

2.2.2 Discrete-Time Fourier Transform

To move into the frequency domain from the time domain the Discrete-time FourierTransform (DTFT) is used. Given a discrete time signal x[n] the DTFT maps thediscrete-time signal to the linear combination of complex exponentials the signalconsists of. The DTFT of x[n] is given by

X(eiω) =

+∞∑n=−∞

x[n]e−iωn (2.3)

11

The DTFT is invertible in the sense that given X(eiω) the original signal may berestored by the inverse DTFT (IDTFT) given by

x[n] =1

2π

∫2πX(eiω)eiωndω (2.4)

These two equations are called the analysis equation (2.3) and synthesis equa-tion (2.4) and are used to move from the time domain to the frequency domainand vice versa [21]. In practice the DTFT of a noisy speech signal is gener-ally a complex-valued function of eiω which can be expressed in polar form asX(eiω) = |X(eiω)|eiφ(eiω) where |X(eiω)| is the magnitude spectrum and φ(eiω)is the phase spectrum [17].

2.2.3 Discrete Fourier Transform

In the case of digital signal processing the DTFT is replaced by the Discrete FourierTransform (DFT) because the DTFT is a function of a continuous variable eiω whichis not compatible with digital computation. In practice the time signal x[n] consistsof N samples and therefore finite contrary to equation (2.3) and so the DTFT canbe sampled at N uniformly spaced intervals by using ωk = 2πk

N , often referred to asfrequency bins. Sampling the DTFT this way yields the DFT given by

X[k] =

N−1∑n=0

x[n]e−i2πnk/N (2.5)

As in the case of the DTFT the DFT is invertible by

x[n] =1

N

N−1∑m=0

X[k]ei2πnk/N (2.6)

Due to the computational complexity of the DFT a Fast Fourier Transform (FFT)algorithm is instead employed to compute the DFT [30].

2.2.4 Power Spectrum

Most signals used in applications cannot be predicted exactly and can only be ex-pressed by probabilistic statements. A random signal can be characterized by apower spectral density (PSD). The power spectral density is the frequency domainspecification of the second-order moment of the signal. To express the PSD theauto-covariance sequence of the stationary signal is needed, given by

r(k) = E[x[n]x∗[n− k]]

12

where ∗ denotes the conjugate and E[·] is the expectation operator. The PSD is theDTFT of the covariance sequence and thus calculated as

P (eiω) =∞∑

k=−∞r(k)e−ikω

The idea of spectral estimation is to estimate how the total signal power is dis-tributed over frequency for finite discrete observations of a stationary process. Thereare both parametric and non-parametric techniques for estimating the power spec-trum. In practice the non-parametric periodogram method is often used to estimatethe PSD due to its simplicity. The periodogram can be computed as

P (k) =1

N|X[k]|2 (2.7)

where X[k] is the DFT of the data sequence x[n] and N is the length of the datasequence. This yields an efficient way of estimating the PSD with help of the FFT[28].

2.2.5 Frame Processing

Even though speech is a non-stationary process it can be assumed to be stationaryfor short periods of time between 10 to 30 ms. This assumption is necessary formodels employing the DTFT. Because of this the practice of frame processing isused. A frame is simply a short segment of the sampled signal to be processedindividually. So, in terms of a VAD or CNE this would mean that for each processedframe you get a binary speech presence decision and a noise estimate. When dealingwith frames it is also common to use a window function which affects the PSD ofthe signal. The most simple window function is the rectangular window, whichwhen used on a signal for the purposes of the DFT transform is identical to framingsignals and applying the DFT to each frame. The rectangular window is definedto be 1 inside the window and 0 outside. Another common window function isthe Hamming window, which add different weights to the signal samples, givingmore weight to the mid samples than the side samples. Using a Hamming windowinstead of a rectangular makes it easier to spot differences in far between frequencybins at the expense of making it harder to separate frequency bins close to eachother. Statistically speaking a Hamming window would, compared to a rectangularwindow, increase the correlation of nearby frequency bins and reduce the correlationof more distant frequency bins [6].

As using a hamming window would give less weight to some samples everyframe uses overlap to make sure that in the end every samples has equal weight.The amount of overlap between frames differs but around 50% is commonly used.This means that when the signal processing step is complete the modified signalmust be restored using the overlap-add method [17] and the IDFT from equation(2.6).

13

2.3 Additional Theory

Given an observation vector x and a vector of target variables θ the goal is topredict θ given a new set of observations. Hence, the goal is to determine theposterior probability density function (PDF) of θ, i.e. p(θ|x). This PDF can withthe help of Bayes’ theorem be expressed as

p(θ|x) =p(x|θ)p(θ)

p(x)(2.8)

where p(x|θ) is the joint distribution of x for a given target variable, p(θ) is theprior target variable PDF and p(θ|x) the corresponding posterior density [2].

2.3.1 Loss, Risk and Bayes Risk function

When estimating any parameter it is in most cases important to add a disparitybetween types of estimation errors. For example in the case of noise estimationthrough VAD marking speech as noise is worse than the other way around as thenthe speech would affect the noise estimate. Also when estimating the noise a largererror should be penalized more than a small error as the larger error will havesignificantly greater effect in the noise suppression step. To do this a loss functionis introduced. The loss function expresses the loss incurred for every error in theestimates. The loss function is usually denoted as

L = L(θ, θ)

where L corresponds to the loss incurred for the estimate θ of θ. So the idea here isto choose the estimator that minimizes the loss. However, in practice θ is unknownand it is therefore hard to find the optimal estimator θ and so the aim is instead totry to minimize the expected loss, called the risk function denoted by R(θ) = E[L][2].

An estimator that relies on Bayes’ rule from equation (2.8) are considered to bea Bayesian estimator and can be derived using a Bayesian risk function. The mostimportant feature for including these risk functions is that it enables perceptualweighting to the estimators, i.e. it makes it possible to include psychoacoustics try-ing to emulate our hearing mechanism in the estimator. The Bayesian risk functionis the expectation of the risk function and given by

< = E[R(θ)] =

∫ ∫L(θ, θ)p(x, θ)dxdθ (2.9)

as the parameter θ is now a stochastic variable. The minimization of Bayes riskfunction with respect to θ for any given loss function yield a variety of estimators[17].

14

2.3.2 Likelihood Ratio Test

Given a random sample X1, X2, ..., Xn from the stochastic variable X with PDFp(X; θ). The likelihood function is defined as

`(θ) :=n∏i=1

p(Xi; θ).

The likelihood ratio test (LRT) is used for hypothesis testing given a set of observa-tions and two hypothesis. The likelihood ratio is a measure of how much more likelyone hypothesis is over the other. The LRT for an observation vector X conditionedon the two different hypothesis H0 and H1 can be defined as

Λ =`(θH1)

`(θH0). (2.10)

and the decision in favor of either hypothesis is dependent on a threshold determiningthe acceptable false rate [18].

2.4 Markov Models

This section is based on [2]. When speech is very prominent in the noisy signal thework of classifying between speech and non-speech is simple. However, this will notalways be the case in practice. The detection of weak speech endings, especiallyunvoiced speech, is troublesome as they often resemble noise. To reduce the risk ofclipping the speech short one can model the correlative nature of speech occurrencesinto the LRT decision. To express this correlative behavior in a probabilistic mannera Markov Model can be used. With the help of the product rule the joint distributionfor a sequence of N observations can be expressed as

p(x1, ...,xN ) =N∏n=2

p(xn|x1, ...,xn−1).

where x1, ...,xN are the observation vectors. If the conditional distribution in (2.4)is independent of all previous observations except the most recent one the modelbecomes a first-order Markov chain. The joint distribution of the first-order Markovchain for N observations is

p(x1, ...,xN ) = p(x1)N∏n=2

p(xn|xn−1).

For every observation xn a corresponding unobservable variable zn is introduced.Under the assumption that the Markov chain is now formed by the unobservable

15

variables a so-called state space model is obtained. The joint distribution for thismodel is given by

p(x1, ...,xN , z1, ...,zN ) = p(z1)N∏n=2

p(zn|zn−1)N∏n=1

p(xn|zn).

If the unobservable variables zn of the state space model are discrete the hiddenMarkov Model (HMM) is obtained. Let the probability distribution of zn dependon the previous state of the unobservable variable zn−1 via the conditional distri-bution p(zn|zn−1). The unobservable variables are binary meaning that the con-ditional distribution correspond to the so-called transition probabilities given byAjk = p(znk = 1|zn−1,j = 1) where znk denote the unobservable variable attainingstate k. The transition probabilities are denoted by the matrix A. As they are prob-abilities they satisfy 0 ≤ Ajk ≤ 1 and

∑k Ajk = 1. The conditional distribution of

K different states can be expressed as

p(zn|zn−1,A) =

K∏k=1

K∏j=1

Azn−1,jznkjk . (2.11)

The initial unobservable variable z1 cannot be defined as in (2.11) and is insteaddefined by a vector of probabilities π with elements πk = p(z1k = 1) so that

p(z1|π) =

K∏k=1

πz1kk

where∑

k πk = 1. To complete the HMM the conditional distributions of theobserved variables p(xn|zn,φ) needs to be defined. The conditional distribution,with parameter set φ = [φ1, ..., φk], is called emission probabilities. The emissiondistribution can for example be Gaussian and so φ represent the parameter setneeded to define the emission distribution. The emission probabilities for K statescan be represented as

p(xn|zn, φ) =

K∏k=1

p(xn|φk)znk .

A homogenous HMM share the parameters A for all of the conditional distributionsof the unobservable variables as well as φ for all of the conditional emission distri-butions. The joint distribution over both unobservable and observed variables aretherefore given by

p(X,Z|θ) = p(z1|π)

N∏n=2

p(zn|zn−1,A)N∏m=1

p(xm|zm, φ) (2.12)

where X = [x1, ...,xN ], Z = [z1, ...,zN ] and θ = [π,A, φ] is the set of modelparameters.

16

2.4.1 Maximum Likelihood for the HMM

Given observed data X the parameters of the HMM can be estimated using max-imum likelihood. The maximum likelihood function is obtained from (2.12) bysumming over the unobservable variables

p(X|θ) =∑Z

p(X,Z|θ).

To efficiently maximize the likelihood function of a HMM the expectation maxi-mization (EM) algorithm is used. The EM algorithm is initialized by a selection ofthe model parameter, denoted θold. The model parameters are often initialized ran-domly subject to model constraints. In the first step of the algorithm (E) the modelparameters are used to find the posterior distribution of the unobservable variablesp(Z|X,θold). This distribution is used to assess the expected value of the logarithmof the likelihood function of the complete data, as a function of the parameters θ,to insert into the function Q(θ,θold) given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ). (2.13)

γ(zn) is introduced as the marginal posterior distribution of a unobservable variablezn and ξ(zn−1, zn) as the joint posterior distribution of two successive unobservablevariables such that

γ(zn) = p(zn|X,θold)

ξ(zn−1, zn) = p(zn−1, zn|X,θold).

γ(znk) is used to denote the conditional probability that znk = 1. ξ(zn−1,j , znk) isdefined in a similar fashion. As both γ(znk) and ξ(zn−1,j , znk) are binary variablesthe expected value of the variable is just the probability that it takes value 1.Substitute the joint distribution from equation (2.12) into equation (2.13) and usethe definitions of γ and ξ to obtain

Q(θ,θold) =K∑k=1

γ(z1k) lnπk

+

N∑n=2

K∑j=1

K∑k=1

ξ(zn−1,l, znk) lnAjk +

N∑n=1

K∑k=1

γ(znk) ln p(xn|φk).

The expectation step of the EM algorithm is used to evaluate γ(znk) and ξ(zn−1,j , znk).This can be done using the forward-backward algorithm. The maximization steptreats γ(znk) and ξ(zn−1,j , znk) as constants and maximizes Q(θ,θold) with respectto the parameters θ. The maximization of Q(θ,θold) with respect to π and A isdone by

πk =γ(z1k)∑Kj=1 γ(z1j)

17

Ajk =

∑Nn=2 ξ(zn−1,j , znk)∑K

l=1

∑Nn=2 ξ(zn−1,j , znl)

. (2.14)

2.4.2 Forward-Backward Algorithm

There are several forward-backward algorithms. Here the alpha-beta algorithm isused. First γ(znk) needs to be evaluated. According to Bayes’ theorem γ(zn) canbe expressed as

γ(zn) = p(zn|X) =p(X|zn)p(zn)

p(X)

where the denominator is implicitly conditioned on θold henceforth. With the helpof the conditional independence property and the product rule of probability γ(zn)can be further expressed as

γ(zn) =p(x1, ...,xn, zn)p(xn+1, ...,xN |zn)

p(X)=α(zn)β(zn)

p(X)(2.15)

withα(zn) = p(x1, ...,xn, zn)

β(zn) = p(xn+1, ...,xN |zn).

The computation of α(xn) and β(xn) can be done recursively. With the help ofconditional independence properties along with the sum and product rule α(zn)can be expressed by α(zn−1) as

α(zn) = p(xn|zn)∑zn−1

α(zn−1)p(zn|zn−1). (2.16)

An initial condition of α(z1) = p(z1)p(x1|z1) is needed to start the recursion.During an EM optimization the value of the likelihood function p(X) is evaluatedby summing over zn for both sides of the equation 2.15 and using the fact thatγ(zn) is a normalized distribution. This can be expressed as

p(X) =∑zn

α(zn)β(zn).

In the case where only the likelihood function is of interest, equation 2.4.2 can bemodified by setting n = N . This means that there is no need for a β recursion andthis reduce the computational cost. The evaluation of ξ(zn−1, zn) can be derivedusing Bayes’ theorem, the conditional independence property (citation) and thedefinition of α(zn) and β(zn) as

ξ(zn−1, zn) = p(zn−1, zn|X) =α(zn−1)p(xn|zn)p(zn|zn−1)β(zn)

p(X).

Hence ξ(zn−1, zn), to be used equation 2.14 to estimate the transition probabilities,is computable by using the results of the α and β recursions.

18

Chapter 3

Method

In this chapter the chosen methods for both VAD and CNE will be introduced alongwith the evaluation format and implementation specifics. This work was divided intwo. The first part revolved around the implementation and evaluation of VAD, i.e.the problem of detecting speech presence in a signal. Whereas the second part wasabout noise estimation, i.e. the problem of trying to characterize the backgroundnoise.

3.1 VAD Methods

For this work 3 VAD methods were implemented and evaluated, however thesemethods will be modified for a total of 10 variants of the original 3 methods. The3 methods were a standard VAD, a VAD method from a literature review and themethod currently used by Konftel. The standard method chosen was the ES 202050 standard by the European Telecommunications Standards Institute [8]. TheES 202 050 standard is a common benchmark method used in the comparison ofVAD methods and will henceforth be known as ETSI. 2 variants of ETSI will beimplemented, described in detail in Section 3.4.2. The method chosen from theliterature review is the statistical model-based VAD (SMVAD) as described in [26].As mentioned in the introduction, over the years there have been some additionsto this SMVAD but they are less suited for the real time implementation needed ina conference telephone. This is because of the non-causality of their decision rulesintroduced when the future observations are used to classify the current frame. Toavoid this the causal decision rule was chosen. Another prominent reason for choos-ing the SMVAD was for its flexibility, i.e heuristic additions can easily be added.It is a solid base which can be expanded upon depending on the need of Konfteland time constraints. A total of 7 variants of the SMVAD will be implemented,described in detail in Section 3.4.2. The third and final method is the VAD used byKonftel as part of the Aurora Audio Algorithm.

19

3.1.1 Aurora

Aurora is the name of the audio algorithm of which the currently implemented VADand noise estimation is a part of. Henceforth Aurora will denote the VAD and noiseestimation procedure only. Aurora splits the entire signal bandwidth of 16 kHzinto smaller frequency bands to be processed individually. The first band includefrequencies between 0-2kHz, the second band include the frequencies 2-4kHz and soforth up to 16kHz. For the VAD only the two first bands are considered; and if nospeech is indicated in either of the bands the noise estimation is updated for the0-8kHz bands. The signal in the lowest band is passed through a high-pass filter toattenuate frequencies below 150Hz as noise usually have more energy than speechin the lowest part of the spectrum.

The VAD itself is based on a number of conditional statements all of which arebased on different ratio measures tracking the signal’s bass and treble root meansquare (RMS) variability in the near and more distant past. The bass is the lowerpart of the frequency band in question and the treble is the rest of that band. TheVAD will indicate speech if either of the conditions are met. There is also a basichangover scheme activated if either of a certain subset of the conditions are met.To prevent getting stuck in a speech period a bypass to these conditions based oncomparing the current signal RMS to the estimated noise RMS. If the signal is belowa certain ratio threshold for a specified amount of frames the decision will be madeto indicate noise period. The basic idea of Aurora is quite similar to the methodproposed in [20].

3.1.2 ETSI

The ETSI standard [8] proposed two different VAD schemes where the first one isused for noise estimation and the second one is used for frame dropping. Framedropping refers to the practice of not sending any data over voice over internetprotocols when there is no speech present to save on bandwidth. For this imple-mentation only the VAD for noise estimation is of interest and therefore the VADfor frame dropping will be ignored.

The ETSI standard bases its noise update decision on a time-domain energybased threshold comparison. According to ETSI for each frame m the logarithmicenergy of the M last samples of the noisy speech is computed as

Em = 0.5 +16

log 2log

(64 +

∑Mi=1 x[n]2

64

)

where x[n] is the sampled noisy speech signal. The frame energy is then used toupdate the mean frame energy, Em, if the difference between the frame energy andmean frame energy is below a threshold. For the first 10 frames this threshold checkis bypassed to ensure that a mean frame energy is calculated. The mean energy is

20

updated asEm = Eprev + (1− α)(Em − Eprev)

The difference of Em and Em indicates speech if it is greater than a threshold. Toavoid clipping of weak speech endings ETSI uses a hang-over scheme which delaysthe transition from speech to noise with a pre-set number of frames. The conditionfor activation of the hang-over timer is that the VAD has marked several consecutiveframes as speech.

3.1.3 Statistical Model-Based VAD

Before moving into the inner workings of this method a description of the statisticalmodel employed is in order. For this model it is assumed that the DFT coefficientsof both speech and noise can be modeled as statistically independent Gaussian ran-dom variables. As both speech and noise are zero mean processes the mean of theDFT coefficients are also assumed to be zero and due to the non stationarity ofeach process it is also assumed that the coefficients have a time-varying variance.The central limit theorem states that the arithmetic means of different samples ofindependent and identically distributed random variables will have a Gaussian den-sity [18]. Applying this to the case at hand, if the sampled sequence is statisticallyindependent then the density of the DFT will be Gaussian as the DFT coefficientsare merely a weighted sum of sampled random variables, see equation (2.5). In thecase of speech signals the central limit theorem holds when adequately separatedsamples are weakly dependent. The assumption of statistical independence used inthis model is equal to the DFT coefficients being uncorrelated. This assumptionis justified by the fact that the correlation between the DFT coefficients approachzero as the DFT analysis frame length approach infinity [6]. While it is here as-sumed that the speech process is Gaussian there have been several investigationstrying to find the PDF of speech ending up with different distributions [11]. Theabsolute value of a complex Gaussian distribution is a Rice distribution, which isapproximately Gaussian for high absolute values of the DFT coefficients [29].

The VAD proposed in [26] test the following hypothesis for each processed frame.

H0 : X = N (speech absent)

H1 : X = N + S (speech present)

where S, N and X are L-dimensional DFT coefficient vectors of speech, noise andnoisy speech respectively calculated by equation (2.5). Under the model assumptionsthe PDF conditioned on H0 and H1 are given by

P (X | H0) =L−1∏k=0

1

πλN (k)e− |Xk|

2

λN (k)

21

P (X | H1) =

L−1∏k=0

1

π(λN (k) + λS(k))e− |Xk|

2

λN (k)+λS(k)

where λN (k) and λS(k) denote the variances of Nk and Sk for k = 1, ..., L. Note thatunder the model assumption the variance of Nk and Sk is equal to the periodogramof the PSD. The LRT, equation (2.10), for the k:th frequency bin can be defined as

Λk =p(Xk|H1)

p(Xk|H0)=

1

1 + ξkeγkξk1+ξk (3.1)

where

γk =|Xk|2

λN (k)(3.2)

ξk =λS(k)

λN (k)=E[|Sk|2]λN (k)

= E[γk − 1] (3.3)

are called the a posteriori and a priori SNR respectively [6]. The additional defi-nitions provided of ξk are of interest in later sections. The geometric mean of thelikelihood ratios of each individual frequency bin is used for the decision rule. Hencethe classification rule is given by

Λ =1

L

L−1∑k=0

log Λk (3.4)

where Λ is then compared to a threshold η to decide in favour of H0 or H1.

Hang-over scheme based on a HMM

In [26] the authors also proposed a HMM based hang-over scheme to model thecorrelative behaviour of speech into the model to prevent the misdetection of weakspeech endings. A first-order Markov process is used for this purpose and thus thecorrelative behaviour of speech is modeled by means of the transition probabilitiesestimated by equation (2.14). Under this model the current state depends on theprevious observations as well as the current observation and therefore the decisionrule is modified as

L(n) =P (H0)

P (H1)

P (zn = H1 |X)

P (zn = H0 |X)

where X represents the set of observations up to frame n, P (H0) and P (H1) denotethe steady-state probabilities obtained by A01P (H0) = A10P (H1).

A variable Γ(n) = P (zn = H1 | X)/P (zn = H0 | X) is introduced representingthe posteriori probability ratio. By defining a forward variable as αn(i) = p(zn =Hi,X) the forward procedure can be used to solve αn(i) using equation (2.16).Hence a recursive formula for Γ(n) is obtained by

Γ(n) =αn(1)

αn(0)=A01 +A11Γ(n− 1)

A00 +A10Γ(n− 1)Λ

22

and the final decision statistic becomes

L(n) =P (H0)

P (H1)Γ(n) (3.5)

As before L(n) is compared to a threshold η to determine whether the frame containsspeech or not.

A Posteriori and A Priori SNR estimation

As the information needed for the decision rule is not readily available it has tobe estimated. For this model the noise λN (k) is assumed to be known in advancethrough any noise estimation procedure. For this implementation the noise estima-tion needed for the SMVAD will be estimated by a CNE introduced in the nextsection, this means that 4 variants of the SMVAD will be implemented and eval-uated. Assuming the background noise is known it remains to estimate ξk, i.e toestimate the variance of speech λS(k). The observation frames are assumed to beindependent. However, in this case a dependence structure is introduced as overlap-ping frames are used but the independence assumption is made nevertheless. Also,it is assumed that the noise and speech variances λN (k) and λS(k) are constantin each of the observed frames. The maximum likelihood estimate of ξk is basedon L consecutive observation frames of the noisy speech for frequency bin k. TheML estimator λS(k) of λS(k) is the non-negative argument (as speech variance obvi-ously cannot be negative whereas equation 3.3 can be negative in practice) by whichthe joint conditional PDF of Xk given λS(k) and λN (k) is maximized. Using theGaussian statistical model the following likelihood function can be formed

p(Xk,m|λS(k), λN (k)) =L−1∏l=0

1

π(λS(k) + λN (k))e−|Xk,m−l|

2

λS(k)+λN (k)

where Xk,m is the noise signal DFT coefficient for frame m and frequency bin k.Maximizing the preceding likelihood function with respect to λS(k) the estimatorλS(k) for frame m can be expressed as

λS(k) = max

⟨1

L

L−1∑l=0

|Xk,m−l|2 − λN (k), 0

⟩

The estimator is constrained to be non-negative as the speech variance cannot benegative. With the help of this speech variance estimator and the definition of thea priori and a posteriori SNR from equation 3.3 and 3.2 an estimator of ξk can bederived as

ξk = max

⟨1

L

L−1∑m=0

γk(m− l)− 1, 0

⟩

23

In practical implementation the running average of γk is replaced by a recursiveaveraging to estimate ξk. By using this method the estimator of ξk in the m:thframe is given by

γk(m) = αγk(m− 1) + (1− α)γk(m)

β, 0 ≤ α < 1, β ≥ 1

ξk(m) = max〈γk(m)− 1, 0〉 (3.6)

And as previously the estimator is constrained to be non-negative. The values of αand β are determined heuristically [6].

For a smoother estimate of ξk causing less fluctuations of the LRT during noiseonly periods [26] the decision-directed (DD) method can be used. The DD methodalso reduces the so-called musical noise when used as a part of a speech enhancementsystem [6]. This method is reliant on the minimum mean square error (MMSE) esti-mate of the short-time spectral amplitude (STSA) of speech. The MMSE estimatoris obtained when Bayes risk function from equation (2.9) is minimized with respectto the estimator |Sk| for the quadratic loss function L(|Sk|, |Sk|)) = (|Sk| − |Sk|)2[6]. Under the model assumptions the MMSE amplitude estimate is

|Sk| = E[|Sk| | Xk]

which can be expressed as

|Sk| =√π

2

√νkγk

e−νk2 ((1 + νk)I0(

νk2

) + νkI0(νk2

))|Xk| (3.7)

where I0 and I1 denote the integral representation of the modified Bessel functionsof zero and first order and

νk =ξk

1 + ξkγk (3.8)

Another speech amplitude estimator is the MMSE log-STSA and is based onthe MMSE estimate of the logarithmic STSA. The MMSE log-STSA estimator isobtained when Bayes risk function from equation (2.9) is minimized with respectto the estimator |Sk| for the logarithmic loss function L(|Sk|, |Sk|)) = (log |Sk| −log |Sk|)2 [7]. Here the speech amplitude is estimated as

|Sk| =ξk

1 + ξke

12

∫∞νk

e−ttdt|Xk| (3.9)

where the integral in equation (3.9) can be recognised as the exponential integralof νk. The exponential integral can be evaluated numerically but at some compu-tational cost. However, there exist good approximations of the integral to ease upon the computational complexity [17]. For the complete derivation of the amplitudeestimators (3.7) and (3.9) the reader is directed to [6] and [7] respectively.

24

Using the relationship between the a priori ξk and the a posteriori γk SNRtogether with the definition of ξk, both obtained from equation (3.3), the DD apriori SNR ξk for frame m can be expressed as

ξk(m) = E[1

2

|Sk(m)|2

λN (k,m)+

1

2(γk(m)− 1)]

and the DD estimator of ξk(m) as

ξk(m) = α|Sk(m− 1)|2

λN (k,m− 1)+ (1− α) ·max〈γk(m)− 1, 0〉 (3.10)

where arguments with m − 1 denote the previous frame estimates of the speechamplitude and noise PSD. The smoothing parameter α is obtained heuristically.A common starting condition of the DD estimation of ξk is ξk(0) = α + (1 − α) ·max〈γk(m)− 1, 0〉 [6].

3.2 CNE Methods

For the implementation and evaluation of CNE 4 methods were chosen. Thesemethods are the likelihood ratio approach (LRA), the SNR dependent recursive av-erage (SNRDRA), the improved minima-controlled recursive averaging (IMCRA)and Rangachari noise estimator (named RNE in this paper but sometimes calledMCRA-2). Recall from the introduction that there are 3 main classes of CNE,the time-recursive averaging algorithms, the minimum tracking algorithms and thehistogram based algorithms. All 4 chosen methods belong to the time-recursive av-eraging algorithms. The LRA method was originally suggested by the author to beused as the noise estimator for the SMVAD [27]. The SNRDRA method bases itsnoise update on the estimated SNR for each frequency bin, making it a very simplemethod [16]. There are two minimum tracking algorithms, the minimum-statisticsalgorithm [19] and the continuous spectral minimum tracking algorithm [5] and theyare incorporated in two of the chosen time-recursive averaging algorithms respec-tively. These methods are IMCRA [4], which uses the minimum-statistics algorithmand RNE [24], which uses the continuous spectral minimum tracking algorithm. Itwas shown that these latter algorithms outperform the original minimum trackingalgorithms in their respective papers. This essentially means that only the histogrambased methods were not evaluated in any form. These 4 methods were comparedamongst each other and to Aurora and ETSI in terms of noise estimation capability.

The time recursive averaging CNE algorithms all rely on a recursive update ofthe noise PSD for each frequency bin. This recursion takes the form of

λN (k,m+ 1) = αN (k,m)λN (k,m) + (1− αN (k,m))|Xk,m|2 (3.11)

where k represents frequency bin and m denotes the frame. The calculation of thesmoothing factor αN is what differs between the methods.

25

3.2.1 Likelihood Ratio Approach

This method was proposed in [27] to be used together with the SMVAD. For thisnoise estimation method a so-called soft-decision noise adaptation is used whichupdates the noise statistic regardless of the presence of a speech or a non-speechframe. The estimator of λN (k,m) can in terms of the MMSE be expressed as

λN (k,m) = E[λN (k,m)|Xk,m]

= E[λN (k,m)|H0]P (H0|Xk,m) + E[λN (k,m)|H1]P (H1|Xk,m)(3.12)

By Bayes’ rule the conditional speech absence and presence probabilities P (H0|Xk,m)and P (H1|Xk,m) can be further expressed as

P (H0|Xk,m) =1

1 + εΛ(k,m)P (H1|Xk,m) =

εΛ(k,m)

1 + εΛ(k,m)(3.13)

where ε = P (H1)P (H0)

and Λ(k,m) is defined as in (3.1). The MMSE STSA speech ampli-

tude estimator from equation (3.7) is used for the DD estimation in (3.10). For theestimator of equation (3.12), |Xk,m|2 for the m:th frame replaces E[λN (k,m)|H0] asnow the signal is expected to contain noise only and E[λn(k,m)|H1] is replaced bythe current noise estimate to avoid adding speech components to the noise estima-tion. By using the geometric mean of the LRT, equation (3.4), instead of Λ(k,m)the smoothing factor αN to be used in equation (3.11) is computed as

αN (m) =εΛ(m)

1 + εΛ(m)(3.14)

Note that this smoothing factor does not take into account the speech presence ineach individual frequency bin but rather the overall speech presence probability.

3.2.2 Improved Minima-Controlled Recursive Averaging

This method was proposed in [4] and although it actually relies on a VAD decisionto exclude speech it resembles a CNE more than a VAD in terms of updating noise.As in the case of the likelihood ratio approach the conditional speech presenceprobability is based on the LRT. By substituting (3.1) into the second part of (3.13)we have

P (H1|Xk,m) =

(1 +

P (H0|Xk,m)

1− P (H0|Xk,m)(1 + ξk(m))e−νk(m)

)−1(3.15)

where νk(m) is computed by (3.8) and ξk(m) by (3.10) using the MMSE log-STSAfrom (3.9). The recursive smoothing variable αN (k,m) to be used in equation (3.11)is calculated as

αN (k,m) = αN + (1− αN )P (H1|Xk,m) (3.16)

26

where αN < 1 is a constant smoothing factor controlling the speech presenceprobability. Using equations (3.11) and (3.16) with the IMCRA method intro-duces a bias towards underestimating the noise. This bias is introduced as thespeech presence probability is biased towards higher values to avoid speech distor-tion. To compensate for this a bias factor is included in the final noise estimatorλN (k,m+ 1) = βλN (k,m+ 1).

The estimator P (H0|Xk,m) of the a priori speech absence probability neededfor (3.15) is controlled by the minima values of a recursively smoothed noisy sig-nal’s PSD. The general idea behind IMCRA is to perform smoothing and minimumtracking of the PSD two times where the first iteration makes a rough VAD decisionin each frequency bin and the second iteration excludes strong speech componentsmaking for a robust minimum tracking during speech activity. The first smooth-ing is done by recursively averaging the noisy speech PSD S(k,m) in time and infrequency. The smoothed noisy speech PSD is estimated as

S(k,m) = αsS(k,m− 1) + (1− αs)Sf (k,m)

where αs is the smoothing factor and Sf (k,m) is the frequency smoothed power

spectrum obtained by applying a window function∑Lw

i=−Lw w(i) = 1 to the noisyspeech power spectrum. Hence the frequency smoothing in each frame for windowlength 2Lw + 1 is computed as

Sf (k,m) =

Lw∑i=−Lw

w(i)|Xk,m|2

The minimum value Smin(k,m) is found by choosing the minimum value of S(k,m)over the past D frames, i.e incorporating minimal tracking. After the first smoothingoperation a rough VAD decision, indicating speech absence, is made by comparingthe threshold parameters γ0 and ς0 to

γmin(k,m) =|Xk,m|2

BminSmin(k,m); ς(k,m) =

S(k,m)

BminSmin(k,m)(3.17)

respectively where Bmin is a constant bias factor inherent to the use of minimumstatistics.

The second smoothing only smooths the frequency bins that have been markedas containing no speech, here donated as I(k,m). The two times smoothed powerspectrum is denoted as Sf (k,m) and given by

Sf (k,m) =

∑Lwi=−Lw w(i)I(k − i,m)|Xk,m|2∑Lw

i=−Lw w(i)I(k − i,m)

if∑Lw

i=−Lw I(k− i,m) 6= 0 and otherwise defined to be S(k,m− 1), which in turn iscalculated as

S(k,m) = αsS(k,m− 1) + (1− αs)Sf (k,m)

27

with initial condition S(k, 0) = Sf (k, 0). The minimum tracking of the noisy speechPSD Smin in the second iteration is defined as the minimum of S(k,m) over thepast D frames. Finally the a priori speech absence probability P (H0|Xk,m) can beestimated as

P (H0|Xk,m) =

1 if γmin(k,m) ≤ 1 and ς(k,m) < ς0γ1−γmin(k,m)

γ1−1 if 1 < γmin(k,m) ≤ γ1 and ς(k,m) < ς00 otherwise

where γ1 is a threshold parameter and γmin(k,m) and ς(k,m) are defined as inequation (3.17) using two-times smoothed PSD instead, i.e S and Smin.

3.2.3 Rangachari Noise Estimation

This method was proposed by Rangachari et al in [24]. First the noisy speech PSDis smoothed with a smoothing constant αs

S(k,m) = αsS(k,m− 1) + (1− αs)|Xk,m|2

In contrast to the IMCRA method the tracking of the minimum PSD is carried outby the continuous spectral minima-tracking method given by the formula

Smin(k,m) = γSmin(k,m− 1) +1− γ1− β

(S(k,m)− βS(k,m− 1))

if Smin(k,m − 1) < S(k,m) and Smin(k,m) = S(k,m) otherwise. Here γ and βare parameter values determined experimentally. The binary speech presence isdetermined by comparing the ratio of S(k,m) and Smin(k,m) to a frequency depen-dent threshold. The speech presence is then smoothed over time by P (H1|Xk,m) =

αpP (H1|Xk,m−1)+(1−αp)P (H1|Xk,m). Making use of equation (3.16) with P (H1|Xk,m)the time-frequency dependent smoothing factor can be computed to be used in thenoise updating scheme of equation (3.11).

3.2.4 SNR Dependent Recursive Averaging

Lin et al. [16] presented a SNR-dependent recursive averaging. Here the smoothingfactor αN to be used in (3.16) is a sigmoid function of the a posteriori SNR γk(m)and computed as

αN =1

1 + e−β(γk(m)−1.5) (3.18)

where β is a parameter controlling the shape of the sigmoid and thus the wayαN changes as a response to γk(m). Larger values of β generally lead to slowernoise updates and smaller values to faster updates with increased possibility ofoverestimating the noise PSD during speech periods. The estimation of γk(m) iscalculated as the noisy speech PSD divided by the mean of the noise PSD for the10 preceding frames.

28

3.3 Methods of Comparison

As mentioned earlier the introduced methods will be compared in two parts. TheVADs (ETSI with 2 variants, Aurora and SMVAD with 7 variants) will be comparedto each other using a receiver operating characteristics curve (ROC). The noiseestimation schemes (i.e. the 4 CNE, ETSI and Aurora) will be evaluated using themean square error (MSE) and median square error (MedSE) between the estimatedand true noise. To evaluate the estimation spread of each method the inter quantilerange (IQR) and the 90-th percentile (90th) of the squared error measure will beevaluated. The CNE will also be tested as part of the SMVAD for performance asa noise estimator to the VAD. The noise estimates will also be analyzed by lookingat the comfort noise each would produce.

3.3.1 Comparison of VAD Methods

To evaluate a classifier’s performance a few terms need to be explained. When deal-ing with a classification problem with two classes there are four possible outcomesof the classification decision for each instance, in this case each frame. Denotingthe classes as positive (speech) and negative (noise) the four outcomes becomes truepositive (TP, positive class classified as positive), true negative (negative class clas-sified as negative), false positive (FP, negative class classified as positive) and falsenegative (positive class classified as negative).

Receiver Operating Characteristics

The ROC curve shows the true positive rate (TPR, percentage of speech framescorrectly classified as speech frames) plotted against the false positive rate (FPR,percentage of noise frames falsely classified as speech frames) [10]. This means thatan ideal VAD would only have the classification outcomes true positive and truenegative. It also tells us that you can easily make a simple VAD with perfect truepositive rate by classifying everything as positive, however this means that you willalso get a worthless true negative rate. The trade-off between true positives andfalse positives is apparent and the idea of a ROC curve is to visualize this. If aloss function is added a point on the ROC-curve will be deemed the best workingcondition, i.e. the point which minimizes the loss function.

For the creation of a ROC curve two things are needed, a predictive variable anda response variable. In the case of the SMVAD the predictive variable is obtainedfrom equation (3.5) computed for every frame and the response is the hand-labeledtrue speech and noise classification for each frame. The true classes are obtained bycomparing the energy of each sample of the clean speech signal and comparing it toan arbitrary threshold and marking the frames containing any sample energy whichis greater than the threshold as a speech frame. A ROC curve can be obtained byvarying the threshold η in equation 3.5 for the predictive variable and see how it

29

affect the true positive and false positive rate. For this comparison the ROC curveswere attained using built-in functions of the software MATLAB R©. The former fitsa generalized linear model with the response and predictive variables and calculatesprobability estimates of belonging to a certain class for each observation which arethen used in the creation of the ROC curve.

Combining Logical Classifiers

As previously mentioned a VAD returns a binary decision of speech or non-speech.However, this decision could in turn be based on a sum of binary decisions as isthe case of the Aurora algorithm. It is of interest to be able to evaluate these kindof classifiers. With two classifiers v1 and v2 in the ROC space it is interesting toknow how the disjoint classifier v3 = v1∧v2 would perform. It can be shown that v3is bounded from below by the maximum TPR of v1 and v2 and from above by theminimum of 1 and the sum of TPR’s for v1 and v2. FPR is bounded in the same way.Assuming independence of the original classifiers the position of v3 can be estimatedas TPR3 = 1 − (1 − TPR1 − TPR2 + TPR1 ∗ TPR2) and FPR3 is estimated inthe same fashion [10]. Hence, having a binary classifier as the disjunction of two ormore binary classifiers could theoretically yield a much better result than either ofthe sole classifiers.

3.3.2 Comparison of Noise Estimation

The CNE will be compared amongst each other and to the Aurora and ETSI algo-rithm by the MSE, MedSE, the IQR and the 90-th percentile of the squared errors.These measures will give an idea how the different methods estimate the noise PSDdifferently.

Squared Error Measures

The MSE is calculated as [17]

1

N

N∑m=1

∑k(λN (k,m)− λN (k,m))2∑

k(λN (k,m))2

where N is the total number of frames and λN (k,m) and λN (k,m) denote the trueand estimated noise PSD for frequency bin k and frame m. The true noise PSD isdefined as the periodogram of the noise signal, i.e equation (2.7). The MedSE iscalculated using the median instead of the mean. The reasoning for using the MedSEas well is that it is not sensitive to outliers whereas MSE is. The use of a normalizedsquared error measure helps to avoid the effects of varying signal strength causedby non stationary signals and different SNR levels.

30

Squared Error Variability

The IQR value is the difference between the third and first quantile of the squarederror values and a measure of the variability of the estimator. The 90-th percentilesquared error value indicates how the method overestimates the noise, which can bemore troublesome than underestimating it. The MSE is a measure heavily affectedby outliers, i.e overestimations in practice. This is because underestimation errorscan never yield a squared error greater than 1 as the noise estimation is nevernegative, however an overestimation is only bounded by the ratio of the noisy speechsignal and the noise signal and therefore can take values greater than 1. Hence,overestimations due to misclassification of speech will have greater effect on thesquared error measures in high SNR.

Statistical Testing

Differences in MedSE and 90-th percentile will be statistically tested at a 5% sig-nificance level for each method pair. The MedSE will be tested using Wilcoxon’srank sum test. The 90-th percentile will be tested with the quantile inference testdeveloped in [15]. As overlapping frames were used two consecutive measurementswill be dependent which means these tests are not reliable. However, to counter thisand allow for the independent assumption to be made measurements are taken onevery fourth frame instead.

Noise Estimation for SMVAD

The CNE methods will also be evaluated as part of the SMVAD. Each of the fourmethods will be employed as the noise estimation procedure used for updating thenoise for the SMVAD. This will give further information as to how the noise isestimated and how useful the technique is.

Comfort Noise

As mentioned earlier the noise estimate is not only used for noise suppression butalso for generating comfort noise. To see how these noise estimates would gener-ate different comfort noise the estimated noise PSD will be converted back intotime-domain noise. These generated comfort noises will be quite informally eval-uated to further understand the estimation characteristics of the methods underinvestigation.

31

3.4 Implementation

3.4.1 Test Signals

All methods for VAD or noise estimation were tested using a signal consisting of40 sentences spoken by 10 males and 10 females in 5 different languages. Thesesentences we concatenated from 20 sound files each containing 2 sentences. Thestructure of every speech file is the same. Each speech file starts of with about1s silence followed by a 2s spoken sentence then 2s of silence followed by a 2ssentence and ends with a 1s silence making for a total of roughly 8s speech file. Thespoken languages were American English, German, Spanish, Dutch and Chinese.The speech signal was embedded in additive noise at different SNR levels. The noisesources were recordings from a cafeteria (highly non-stationary noise), from a street(non-stationary noise) and computer generated white Gaussian noise (stationarynoise). The cafeteria noise contains soft babble noises and the occasional sound ofcutlery in use. The street noise is low frequency traffic noises which include passingcars. The Gaussian noise signal was generated by using numbers obtained from aGaussian distribution. The SNR levels used were 0, 6, 12 and 18 dB as well as astrongly varying SNR level that shifts between 6 and 18 dB. This strongly varyingSNR level was used to test the VAD’s ability to detect speech in a very unstableenvironment. Both the speech and the noise signals are part of the standard ITU-TP.501 sound files.

For each two sentence sound file a randomly selected and equally long segmentof the noise sound file was chosen and added to the speech file at a specified SNRlevel. The noisy sentences were then added together to make the 40 sentence longsignal. This means that between each segment there will be a change in noise. TheVAD’s and noise estimation schemes were tested using the same setup. Figure 3.1shows three different four sentences long example signals and the basic characteristicof the different noise sources.

All signals were analysed using a Hamming window on 32 ms long time framessampled at 16kHz with 50% overlap. Also the signals for all methods were passedthrough a highpass filter, as described in section 3.1.1, to make for a more equalcomparison. A Butterworth filter of fourth order with a cutoff frequency of 150Hzwas used for this purpose.

3.4.2 Implementation of VAD Comparison

The SMVAD presented in [26] was implemented as described therein, keeping allparameter values as specified in the paper unless otherwise stated. The transitionprobabilities were sewt to A01 = 0.2 and A10 = 0.1 by the author. These transitionprobabilities were replaced by estimations based on the 40 sentence sample usingequation (2.14).

As both the ETSI standard and the Aurora algorithm only return binary de-

32

(a)

(b)

(c)

Figure 3.1: Example signals of speech corrupted by different noise sources at variousSNR. The sentences in (a) are spoken by a German and Spanish male in 18 dBSNR white noise. In (b) the speech is from a German female and a German maleembedded in 12 dB SNR street noise and in (c) a Dutch female and a Chinese malespeak in 6 dB SNR cafeteria noise. The speech is shown in red and the noise inblue.

33

cisions for speech detection obtaining a ROC curve is difficult, i.e you only obtaina point. There are several parameter values for both algorithms that have beenacquired empirically which makes it difficult to find a good starting point for thecreation of a ROC curve. However, the ETSI standard have been slightly modifiedto allow for a ROC-curve to be created. The modification is done by storing theenergy difference measure used for the VAD decision and using it as a predictivevariable to create a ROC-curve.

The Aurora algorithm was converted from its C implementation into MATLAB R©by studying the source code provided by Konftel. This was done to allow for sim-pler testing. All the parameters values have been kept as specified by Konftel. TheETSI algorithm was implemented as described in [8]. Keep in mind that only thepart involving the VAD for noise estimation was implemented, not the entire ETSIstandard.

In the Aurora algorithm there are a lot of parameter values that have been setempirically and specifically tuned to work in the device it was designed for. Thismeans that it is sensitive to signal strength and it is therefore essential to scale thesignal correctly to obtain comparable results. This is not an issue for the SMVAD.The signal scale was obtained empirically for best performance in cafeteria noise at18 dB SNR. The same procedure was carried out for ETSI as it also requires correctsignal strength for best performance. The signal scale used for the Aurora algorithmwas set to 3.369 ∗ 10−4 and the signal scale for ETSI was set to 700. The comfortnoise was generated using a implementation provided by Konftel.

ROC Curves

A total of eight ROC curves and two operating points (OP) were created for eachSNR level and different noise source. As mentioned the SMVAD uses a CNE toestimate the noise, which means that it will be evaluated with each of the fourCNE methods for a total of four ROC curves. Both SMVAD with IMCRA andLRA have an additional ROC curve where the ML a priori SNR estimation is usedinstead of the DD a priori SNR estimation. An additional ROC curve was added forthe SMVAD with IMCRA by using the ML estimation and an additional predictivevariable based on the LRT for the first 8 frequency bins (62.5-500 Hz) named IMCRAML+. This addition is based on the fact that roughly 80% of the speech power islocated between 100-500Hz [9]. This ROC curve was added to show that furtherheuristic improvements can easily be made. These are the 7 variants of the SMVAD.Furthermore the modified ETSI scheme have a ROC curve as well as a operatingpoint (2 variants) whereas Aurora only have an operating point. Note that thefigures presenting the ROC curves show a FPR from 0 to 1 and a TPR of 0.8 to 1as it is important to achieve a high TPR to be effective. Unless otherwise specifiedIMCRA refers to the family of classifiers including the three IMCRA based methods.The same goes for the two LRA mehods.

34

3.4.3 Implementation of Noise Estimation Comparison

The CNE methods are implemented as suggested in their respective papers. Withthe exception that for the RNE method different values of the frequency dependentthreshold were used and for the SNRDRA method β in equation (3.18) was set to0.6, both suggested in [17]. In [27] the variable ε is defined to be the ratio betweenthe stationary probabilities of speech and noise but the authors also argued that itcan be viewed as a adaptation speed factor. This parameter was set to 200 aftersome rough empirical testing trying to optimize performance of the VAD in streetand cafeteria noise for high SNR.

To make the comparison more equal ETSI employ the same recursive noiseupdate scheme as Aurora. For every frame m marked as noise the noise for frequencybin k is estimated as

λN (k,m+ 1) = αN λN (k,m) + (1− αN )|Xk,m|2

where αN is a constant smoothing factor set to be 0.85 and Xk,m are the signal DFTcoefficients. If the frame is marked as speech the previous frame noise estimate isused.

The Wilcoxon’s rank sum test was done in MATLAB R©. The quantile inferencewas performed using the code discussed by the author in [15] and available on theauthor’s website.

35

Chapter 4

Results

This chapter includes the results from the implementations of the presented methodsfor VAD and noise estimation. The informal visual inspection of the comfort noisethat would be generated by the different noise estimations are also presented.

4.1 Voice Activity Detection

4.1.1 Cafeteria Noise

As expected the highest SNR level yield the best VAD results for all methods andthat both ETSI and Aurora’s operating points change with the noise strength yield-ing a higher FPR while having roughly the same TPR, see Figure 4.1. IMCRAperforms best across the tested noise levels and IMCRA ML+ is the best of thethree. At the highly varying SNR level the hueristic addition have an adverse effecton classification. Also, the a priori SNR estimation method DD does not improvethe ability to detect speech, on the contrary it has a negative effect. It is very no-ticeable how RNE’s relative performance increases as the SNR decreases, where itmoves from being among the worst noise estimator for the SM VAD to the secondbest. SNRDRA does fairly well across the SNR levels and show no big differencesbetween the noise conditions. Both LRA methods show the worst classifying abilityexcept for in high SNR where it is comparable to every method except IMCRA. TheDD estimation perform better in every condition. ETSI show a similar performancecompared to the other methods for every SNR level but. Overall it places itselfin the middle of the pack performance-wise. Also, take note of the great disparitybetween ETSI OP and ETSI.

ETSI OP show better classifying ability compared to Aurora OP and bothshow similar performance to IMCRA except for the highly varying SNR level whereIMCRA clearly performs better. For the 12 dB case the best performance of theoperating points needs to be determined by a loss function as they are not horizon-tally or vertically aligned. While performing well in high SNR both of the operating

36

points show a very high TPR and FPR rate when the SNR is low, meaning thatnearly every analyzed frame has been marked to contain speech.

37

(a) (b)

(c) (d)

(e)

Figure 4.1: ROC curves displaying the speech classification ability of the SMVADutilizing different CNE (IMCRA, LRA, RNE and SNRDRA) and a priori SNRestimator (ML or DD) and heuristic addition (+) as well as ETSI and Aurora incafeteria noise at 18 dB (a), 12 dB (b), 6 dB (c), 0 dB (d) and varying between 6and 18 dB (e) SNR. OP stands for operating point.

38

4.1.2 Street Noise

It is clear that the different noise sources have an impact on the classifiers. Theperformance is overall better in street noise than in cafeteria noise, which can beseen in Figure 4.2. This is especially prominent for ETSI OP and Aurora as theyfor the same SNR are able to detect speech much more reliably in street noisethan in cafeteria noise. As a reminder, Figure 3.1, showing the example signals,makes it quite obvious that the cafeteria noise compared to the street noise is morenon-stationary. Once again the overall best performance is obtained by utilizingthe IMCRA ML+ scheme, followed by the other two IMCRA implementations.However, in the highly varying environment IMCRA ML+ again show worse resultsthan the original IMCRA. As before the DD a priori SNR estimation methods seemto have a negative effect on speech detection. RNE does not improve its relativeperformance to IMCRA for low SNR as in the cafeteria noise case but rather seemto have a similar relationship to the methods across the SNR levels. At the higherSNR levels it is able obtain quite similar results as IMCRA DD. SNRDRA seem tobe performing quite well for the high and the highly varying SNR levels but seemdo degrade for the lowest SNR. LRA performs worst but the DD estimate improveit’s detection ability. LRA once again seem to be the method most affected by lowSNR. ETSI does quite well for high SNR, especially for low FPR.

The two operating points have a similar performance to the cafeteria noiseenvironment with the notable exception that they are able to detect speech muchmore reliably in the two lowest SNR levels. However, it would appear that theirrelative performance to IMCRA has somewhat decreased except for in the highlyvarying noise situation. Overall ETSI OP perform better than Aurora OP.

39

(a) (b)

(c) (d)

(e)

Figure 4.2: ROC curves displaying the speech classification ability of the SMVADutilizing different CNE (IMCRA, LRA, RNE and SNRDRA) and a priori SNRestimator (ML or DD) and heuristic addition (+) as well as ETSI and Aurora instreet noise at 18 dB (a), 12 dB (b), 6 dB (c), 0 dB (d) and varying between 6 and18 dB (e) SNR. OP stands for operating point.

40

4.1.3 White Noise

Not surprisingly all methods perform the best in high SNR white noise comparedto the other noise sources. As per usual the IMCRA methods works best, but incomparison to the previous noise environments the heuristic addition show slightlyworse results, see Figure 4.3. Unlike previously the disparity between IMCRA MLand DD is much greater. RNE does well and even performs better than IMCRADD for the lowest SNR. The relationship between LRA and SNRDRA seem to bea bit interchanging and non decisive. At high SNR they perform similar to eachother but a lower SNR the relationship change from SNRDRA bing better to itbeing worse indicating volatile estimators. ETSI has significant trouble with whitenoise and performs overall worst. Notable for most methods is the vertical rise ofthe ROC curve in the highly varying SNR.

In stark contrast to the previous noise cases the white noise treats the ETSIOP and Aurora very differently. In the two best noise conditions their performanceis similar to the previous noise cases, however moving to lower SNR makes ETSIOP detect very little of the speech and at 0 dB it does not even show up in theROC space. For reference it scored a TPR and FPR of approximately 63% and 2%.Aurora on the other hand show a very stable and good performance.

41

(a) (b)

(c) (d)

(e)

Figure 4.3: ROC curves displaying the speech classification ability of the SMVADutilizing different CNE (IMCRA, LRA, RNE and SNRDRA) and a priori SNRestimator (ML or DD) and heuristic addition (+) as well as ETSI and Aurora inwhite noise at 18 dB (a), 12 dB (b), 6 dB (c), 0 dB (d) and varying between 6 and18 dB (e) SNR. OP stands for operating point.

42

4.2 Noise Estimation

4.2.1 Cafeteria Noise

First thing to take note of is that no single algorithm outperforms every other inevery situation. It is therefore clear that the best method is obtained through weigh-ing the performance characteristics of each method against each other. Secondly theoverall performance seem to increase as the SNR level falls as can be sen in Table4.1. SNRDRA achieves the lowest MedSE in all the different SNR levels, howeveran overall second highest IQR, 90-th percentile and MSE comes with that whencompared to the other methods. For high SNR the aforementioned method has thehighest MSE value and at low SNR LRA has the highest MSE which points towardsthese methods suffering from big outliers. The four other methods show similarMSE values across the SNR spectrum.

The performance of IMCRA and RNE seem to be comparable overall. At highSNR IMCRA outperforms RNE but as the SNR decreases so does the advantageover RNE. At 12 dB no significant difference between the two methods can befound and at 6 dB and below RNE outperforms IMCRA in terms of variability ofthe estimates. Take note of the fact that only IMCRA performs more or less equallyin all SNR levels, i.e the measurements change relatively less for IMCRA comparedto all the other methods when the SNR level decreases. SNDRA seem to achieveits best performance for the highly varying SNR and both IMCRA and RNE doeswell in this setting compared to the VAD based estimators. LRA has the highestIQR for every SNR level and the highest 90-th percentile. LRA performs worst ofthe 6 methods being compared, although it manages to achieve a similar MedSE asAurora at 6-18 dB and better for 0 dB.

Aurora seem to be more affected by different SNR levels than the other methodsand the best performance is obtained at 12 dB SNR. Also, at 12 dB a smaller MedSEvalue accompanied by a slightly higher variation is observed when compared toIMCRA and RNE. However, Aurora has trouble at 0 db or 6 dB where the VADindicated all frames as speech (see Figure 4.1), hence the noise was never updatedafter the initial noise estimate. For these cases the estimation variability is very low.This issue is probably caused by the signal scaling being set to perform well in highSNR but also because of the highly non-stationary cafeteria noise. The problem ofhighly non-stationary noise can again be observed in the 6-18 dB environment.

ETSI seem to perform well and better than Aurora, at the highest SNR levelit achieves the second best MedSE and a slightly higher variability compared toIMCRA and RNE. The same holds for 12 dB. ETSI also have trouble with 0 and6 dB SNR as it marks most frames as speech which stops the noise estimation andshows a very low varibility for these two cases. And again this is most likely due tothe signal scaling and the nature of the noise. ETSI also cannot handle the highlyvarying environment well either.

43

Table 4.1: Noise estimation comparison measurements for the speech embeddedin cafeteria noise at different SNR levels. The superscripted numbers indicates nodifference between the numbered methods in median or 90 percentile tested at a 5 %significance level. The method numbers are 1-6 in descending order. The median wastested using Wilcoxon’s rank sum test and the 90-perc with the quantile inferencetest discussed in section 3.3.2. The two best measurements for each SNR level arehighlighted.Cafeteria noise Method MedSE IQR 90-perc MSE

18 dB IMCRA 1 0.658 0.339 1.105 0.923RNE 2 0.692 0.394 2.107 1.684SNRDRA 3 0.515 0.656 6.486 70.953LRA 4 0.827 2.225 17.344 20.280Aurora 5 0.722 1.507 8.338 5.223ETSI 6 0.587 0.527 2.722 1.788

12 dB IMCRA 0.656 2 0.351 1.223 2,6 0.936RNE 0.659 1 0.339 1.120 1,6 0.915SNRDRA 0.478 0.655 6.680 23.414LRA 0.846 2.672 20.355 37.370Aurora 0.625 0.415 1.798 1.263ETSI 0.598 0.374 1.146 1,2 0.934

6 dB IMCRA 0.667 2 0.337 1.124 1.016RNE 0.673 1 0.310 0.972 0.883SNRDRA 0.478 0.634 4.961 15.774LRA 0.767 1.457 9.885 10.061Aurora 0.744 0.263 0.934 6 0.784ETSI 0.721 0.292 0.931 5 0.767

0 dB IMCRA 0.660 2 0.327 0.995 0.848RNE 0.669 1 0.308 0.946 0.777SNRDRA 0.413 0.490 1.432 1.843LRA 0.710 0.831 4.470 3.250Aurora 0.750 0.249 1.083 0.948ETSI 0.708 0.281 0.919 0.731

6-18 dB IMCRA 0.740 0.412 1.080 1.710RNE 0.744 0.401 1.5006 2.150SNRDRA 0.534 0.641 4.895 24.117LRA 0.825 5 0.726 6.455 9.936Aurora 0.862 4 0.320 3.286 3.219ETSI 0.816 0.414 1.318 2 1.504

44

4.2.2 Street Noise

First thing to take note of here is that there are more insignificant differences whencomparing IMCRA and RNE to Aurora, indicating that the relative performance ofAurora has increased, which can be seen in Table 4.2.

Once again SNRDRA achieves the smallest MedSE values across all SNR levelsbut also the overall second highest IQR, 90-th percentile and MSE. LRA’s perfor-mance is the worst in both terms of MedSE and variability. As in the cafeteria noisecase IMCRA and RNE show overall comparable performance where once again theformer does better in high SNR and the latter in low SNR. At the highest SNR levelIMCRA outperforms RNE and Aurora but as the SNR decreases RNE perform bet-ter than IMCRA while Aurora show equal MedSE and higher variability. At 12dB SNR the three algorithms perform equally well in terms of MedSE, althoughIMCRA has the lowest variability of the three followed by RNE.

At 6 dB SNR there is no significant difference between RNE and Aurora aswell as no difference between Aurora and IMCRA in terms of MedSE. However,there are differences in variability where both IMCRA and RNE show lower spreadthan Aurora. Note that IMCRA seem to have the most stable performance overthe span of the SNR levels. All methods except for IMCRA perform considerablyworse at the highest SNR level. Aurora and RNE especially when compared to theirperformance at 12 dB.

For low SNR cafeteria noise both ETSI and Aurora indicated most frames asspeech but in street noise they can much more reliably update the noise. OverallETSI still obtains the second best MedSE but a higher variability compared toIMCRA and RNE yet lower than SNRDRA positioning the method in betweenthese three in terms of estimation characteristics. Again it is quite noticeable thatwhen both ETSI and Aurora are not able to update the noise they achieve the lowestvariability.

45

Table 4.2: Noise estimation comparison measurements for the speech embeddedin street noise at different SNR levels. The superscripted numbers indicates nodifference between the numbered methods in median or 90 percentile tested at a5 % significance level. The method numbers are 1-6 in descending order. Themedian was tested using Wilcoxon’s rank sum test and the 90-perc with the quantileinference test discussed in section 3.3.2. The two best measurements for each SNRlevel are highlighted.Street noise Method MedSE IQR 90-perc MSE

18 dB IMCRA 1 0.700 6 0.419 1.830 1.371RNE 2 0.754 1.238 12.192 3 18.595SNRDRA 3 0.436 0.728 10.283 2 134.689LRA 4 0.906 5 8.584 98.174 5 261.260Aurora 5 1.037 4 13.550 112.287 4 72.626ETSI 6 0.646 1 1.051 7.499 10.281

12 dB IMCRA 0.701 2,5 0.417 1.576 1.331RNE 0.692 1,5 0.415 2.493 4.158SNRDRA 0.398 0.539 4.864 71.309LRA 0.811 3.406 40.192 48.262Aurora 0.663 1,2 0.766 7.269 5.734ETSI 0.598 0.436 1.778 1.845

6 dB IMCRA 0.704 5 0.442 1.501 2 1.188RNE 0.678 5 0.417 1.336 1 1.052SNRDRA 0.371 0.543 5.680 21.230LRA 0.817 2.796 26.208 31.982Aurora 0.685 1,2 0.494 1.966 1.434ETSI 0.657 0.489 1.190 1.001

0 dB IMCRA 0.703 0.441 1.442 1.268RNE 0.666 0.407 1.082 1.198SNRDRA 0.353 0.441 2.979 6.178LRA 0.695 5,6 1.439 9.921 10.553Aurora 0.774 4,6 0.353 0.995 1.204ETSI 0.801 4,5 0.352 0.977 0.825

6-18 dB IMCRA 0.749 0.420 1.326 6.097RNE 0.751 0.460 4.895 18.419SNRDRA 0.411 0.632 5.281 44.179LRA 0.856 2.388 19.369 49.335Aurora 0.764 1.278 32.540 94.750ETSI 0.647 0.558 3.965 6.727

46

4.2.3 White Noise

Not surprisingly the methods achieve the lowest measurement values for the sta-tionary white noise when compared to the previous two noises cases, see Table 4.3.As before the smallest MedSE is achieved by SNRDRA as well as the overall secondto worst variability measure. Even so at 0 dB the varying dB level it performs thebest, barring a slightly higher IQR and MSE compared to IMCRA and RNE. LRAperforms worst in all measurements.

The same relationship between IMCRA and RNE as previously discussed canbe observed. At high SNR IMCRA performs better but the discrepancy between thetwo algorithms diminishes with the SNR level where at 0 dB there is no significantdifference between the two. Aurora achieves a smaller MedSE compared to IMCRAand RNE at 12 dB but a higher variability, however Aurora is surpassed by the twoin every other SNR level. Aurora have trouble with the low SNR marking mostframes as speech.

ETSI’s performance compared to the other methods is similar to the previousnoise case where is places itself among IMCRA, RNE and SNRDRA. However, at thehighly varying SNR level it obtains the highest MedSE. Overall ETSI and Auroraseem to be comparable where for higher SNR ETSI perform better while Auroraachieves a smaller variability for lower SNR albeit still a slightly higher MedSE.

47

Table 4.3: Noise estimation comparison measurements for the speech embeddedin white noise at different SNR levels. The superscripted numbers indicates nodifference between the numbered methods in median or 90 percentile tested at a5 % significance level. The method numbers are 1-6 in descending order. Themedian was tested using Wilcoxon’s rank sum test and the 90-perc with the quantileinference test discussed in section 3.3.2. The two best measurements for each SNRlevel are highlighted.White noise Method MedSE IQR 90-perc MSE

18 dB IMCRA 1 0.523 0.074 0.600 0.527RNE 2 0.563 0.224 1.491 1.097SNRDRA 3 0.454 0.182 0.886 2.675LRA 4 0.827 70.832 1031.721 895.139Aurora 5 0.633 1.468 5.784 2.842ETSI 6 0.510 0.098 0.659 0.583

12 dB IMCRA 0.520 0.072 0.592 0.522RNE 0.535 0.088 0.646 0.561SNRDRA 0.441 0.140 0.704 1.372LRA 0.654 19.685 166.798 133.095Aurora 0.522 0.128 0.758 0.613ETSI 0.501 0.081 0.588 0.508

6 dB IMCRA 0.519 5 0.072 0.588 0.521RNE 0.521 0.072 0.594 5 0.524SNRDRA 0.439 0.122 0.631 6 0.635LRA 0.608 3.966 29.560 18.022Aurora 0.517 1 0.080 0.598 2 0.521ETSI 0.505 0.088 0.612 3 0.530

0 dB IMCRA 0.516 2,6 0.071 0.586 2,3 0.518RNE 0.516 1,6 0.068 0.586 1,3 0.518SNRDRA 0.430 0.114 0.589 1,2,5 0.575LRA 0.577 0.705 4.509 2.459Aurora 0.522 0.073 0.599 3 0.530ETSI 0.511 1,2 0.104 0.691 0.572

6-18 dB IMCRA 0.531 0.092 0.916 0.767RNE 0.552 0.149 0.939 1.017SNRDRA 0.445 0.156 0.835 0.932LRA 0.889 9.704 165.123 521.530Aurora 0.554 0.337 3.060 1.683ETSI 0.923 0.434 0.965 0.782

48

4.3 Comfort Noise

This informal comfort noise inspection is mostly used as a visual aid but do showthe basic estimation characteristics of the estimators in the absence and presenceof speech. Figure 4.4 show the generated comfort noise for the different noise es-timations. ETSI was not evaluated for this. It is quite obvious when inspectingthe figure that the noise estimator do in fact deal with speech presence differently.RNE and IMCRA show the least noise estimate increase when speech is present butthe former is nonetheless more affected than the latter. The LRA method heavilyunderestimates the noise early on due to the choice of adaptation parameter butafter half the segment it stabilizes and it is obvious that speech presence increasethe noise estimate. SNRDRA show the most rapid increase in the noise estimateduring speech and is especially affected by long speech periods. However, it quicklyadapts to the noise when the speech is gone. This is not the case for LRA for whichthe noise estimate decays slow at the offset of speech. Aurora seem to track thenoise best when no speech is present of all methods but is still affected by speech asthe noise estimate increase with the onset of speech.

49

(a) RNE (b) IMCRA

(c) LRA (d) SNRDRA

(e) Aurora

Figure 4.4: Comfort noise generated by using the different noise estimation methodsin the cafeteria environment at 18 dB SNR. The estimated noise is shown in red,the speech in blue.

50

Chapter 5

Discussion and Conclusion

Within this chapter the various implementations will be evaluated and discussedbased on the results presented in the previous chapter. This work will also becritiqued herein. Furthermore conclusions will be drawn and further work suggested.

5.1 VAD Evaluation

This section presents the evaluation of the VAD methods.

5.1.1 Statistical Model-Based VAD

SMVAD with IMCRA

When comparing all the noise estimation techniques used together with the SMVADit is clear that the overall best performance is attained by employing the IMCRAestimation where the heuristic addition works best in most conditions. The reasonfor the good performance of IMCRA is credited to the use of minimum trackingwhich assures that the noise will not be overestimated if a long enough look-backwindow is used. If the noise is not overestimated it makes it easier for the SMVADto distinguish between speech and noise. This window to find the minimum statisticwas set by the author to be 120 frames which is roughly equal to 2s. This wouldof course mean that if there is a sudden noise increase the noise estimate wouldn’tincrease for 2s, which can be considered to slow. This kind of minimum tracking isvery heuristic in nature and for it to work well a long enough search window mustbe allowed but this at the expense of adaptation speed in non-stationary noise.

SMVAD with RNE

RNE performed well as the noise estimation procedure in the SMVAD placing overallsecond. This method use the continuous spectral minimum tracking method sowhere IMCRA is reliant on a look-back window RNE is not and therefore able to

51

adapt to noise changes quicker. A drawback of this is having a harder time withspeech presence than IMCRA, i.e. for speech segments it includes more of the speechin the noise estimate making it harder for the SMVAD to differentiate between noiseand speech. Here lies a notable difference between IMCRA and RNE.

SMVAD with LRA

The LRA method has performed worst across every test. The problem with thismethod seem to be that in its original form where ε would be the ratio of stationaryprobabilities it incorporates too much of the speech spectrum into the noise esti-mation. To try to mitigate this a heuristic value of ε was used to slow down theadaptation speed. However, it is clear that this method and the heuristic attemptto improve it has yielded the worst results compared to the other methods. Anotherproblem of this method is that the noise update variable is not frequency dependentbut rather the same for the entire spectrum. This will adversely affect the noise up-dates and it is reasonable to think that basing the update factor on each frequencybin instead would yield better results.

SMVAD with SNRDRA

The SNRDRA noise estimator seems to be very volatile which is reasonable as onecould expect the noise updating to be very rapid and thus heavily influenced byspeech presence with a low β. This would explain its unstable speech classifyingas a VAD noise estimator where its relative performance compared to the othermethods seems to shift some. However, in the highly varying SNR setting it didperform well due to its quick adaptation speed. It is quite possible that β = 0.6might be too low to track the noise satisfactorily in high SNR. By changing it to ahigher value, making the sigmoid have a sharper transition of the smoothing factorit is reasonable to think that the method would perform better in high SNR at theexpense of performance in low and varying SNR.

A Priori SNR Estimator

There was a clear difference in performance between the two different a priori SNRestimation methods for IMCRA. The ML method performed much better as a speechdetector in the SMVAD. However, using IMCRA as a noise estimator only thedifferent estimations methods might still differ as they will have different effects onthe estimated noise, i.e when used as a part of a noise supression system it might stillyield better speech quality. This was not evaluated. Using the ML estimation saveson computational resources. While The ML estimation showed better performancethan DD for IMCRA the opposite was true for LRA. Here it made for a betterestimator for the speech detector across every evaluated condition as was shown in[26].

52

5.1.2 ETSI

As previously discussed the signal scaling used for this method was found empiricallyand therefore the scaling might not be optimal which would have an adverse effect onthe speech detection capabilities. However, moving forward the assumption is madethis method is correctly implemented and that the signal scaling is satisfactory. Thespeech signal structure will have an impact on a method like ETSI which employsa static hang-over scheme. As indicated by the ROC curves the inherent classifyingability of the energy difference measure cannot be solely used as a VAD. As thedifference measure never was intended to be used as a classifier by itself this makesthe comparison a bit unfair but it still shines a light on a point to make, namely thatin this case the static hang-over scheme worked very well in high SNR. With a morevarying sentence structure across the sound files another picture might emerge. Butat the same time as it did a phenomenal job in high SNR it, along with signal scaling,had an adverse effect on the performance in low SNR. However, it is reasonable totune a method for higher SNR as this is a more likely scenario. Another reasonfor the exceptional performance of the static hang-over scheme could be the energythreshold to define the ideal VAD being set to low. This would make it possiblefor the static hang-over to detect the very weak speech endings but not the HMM-based hang-over as it relies on the LRT which would be very small when the speechis weak. This should however have minimal effect.

ETSI’s operating point performed overall better than Aurora except for low SNRwhite noise and equal to IMCRA in terms of speech detection. The reason that ETSIhas trouble with white noise is due to not using any kind of spectral analysis as theenergy measure is based on the entire frequency spectrum. Not using analysis ofthis sort ignores the frequency dependent structure of speech giving equal weightto the entire energy spectrum even though there are no speech components at thehighest frequencies. This was not a problem for the other two noise types as theyhave spectral content more similar to speech.

5.1.3 Aurora

As with ETSI it will be assumed that Aurora is implemented in such a way that itis a good representation of the algorithm used by Konftel. One noticeable drawbackof Aurora is that the method seems to be very reliant of signal strength, more sothan ETSI. This will be an issue as it will work well in a smaller signal strengthwindow making it very sensitive to high noise. First of all the noise will affect thesignal strength but secondly will likely cause the speaker to speak louder as well ormove closer to the device, both resulting in further increasing the signal strength.This algorithm is not very robust but does work well when the signal scaling fitsand the SNR is high. This was clearly shown in the 18 and 12 dB cases where itperformed similar to SMVAD with IMCRA in terms of speech detection. Aurora’svery good performance in white noise could possibly be due to the parameter values

53

being developed in white or coloured noise.

5.2 Noise Estimation Evaluation

This section evaluates the four CNE’s, ETSI and Aurora as a noise estimator.

5.2.1 Continuous Noise Estimation

IMCRA

As IMCRA tracks the minimum of the noise PSD it is not surprising that it doeswell in all SNR levels and for all kinds of noise when looking at the squared errormeasures. As previously noted IMCRA is very stable and did not degrade in eitherhigh nor low SNR as did most other methods. However its performance comparedthe other methods did become less good in low SNR. An explanation to this wouldbe that the indicator function (which is basically a VAD) used for removing speechcomponents marks more and more frequency bins as containing speech resulting inan outdated noise estimate in parts of the frequency spectrum. Even though thismethod has some drawbacks it is a very stable estimator. Robustness is probably themost important feature for a noise estimator in the squared error sense as it is moreimportant to not overestimate the noise than the other way around. Robustnessshould be an attractive feature when used as part of a speech enhancement systemas it seemingly would have a comparable noise suppression regardless of speechpresence and noise conditions.

RNE

RNE overestimated the noise due to speech activity which was especially obviousfor high SNR. This overestimation was not prominent for the lower SNR levels andit is clear that RNE tracks the noise spectrum better in low SNR due to its fastadaptation speed. For the highly varying SNR this method struggled with everypart with high SNR, where as previously mentioned incorporating speech in thenoise estimate affects the squared errors measure more than for low SNR whichcould probably explain why it did not perform better than IMCRA.

LRA

LRA performed worst. See VAD Evaluation section.

SNRDRA

SNRDRA suffered from heavy overestimations while it always performed best interms of MedSE. In contrast to the other CNE methods SNRDRA does not includeany speech presence uncertainty in the recursive update which would explain the bad

54

performance in the presence of speech. This would suggest that it tracks the noisyspeech spectrum well but not necessarily the noise spectrum. This is indicatedby its performance increase as the SNR drops, because at low SNR tracking thenoisy speech spectrum is close to just tracking the noise spectrum and thereforeit seemingly performs well at low SNR. Nevertheless, even at the lowest SNR theoverestimation is noticeable.

5.2.2 ETSI

When utilizing ETSI as a noise estimator the biggest problem is that it stops thenoise estimation due to falsely classifying the signal as speech, which happens inmost low SNR conditions. This causes the noise estimate to be very unreliable andnot very useful. However, when the SNR is high it does a good job at estimatingthe noise. In short it is not a robust estimator in comparison to e.g. the IMCRAmethod. In general ETSI did perform better than Aurora.

5.2.3 Aurora

Aurora suffer from the same problems as ETSI. As previously mentioned Aurorautilizes 10 conditional statements to check for speech presence and neither of thesewere designed to work as a sole classifier. Instead they were meant to detect differentfeatures of speech and together define a good classifier. It therefore makes it harderto evaluate Aurora as an empirical method. Also, evaluating each of the logicalclassifiers separately does not help much as it cannot be assumed that the conditionalstatements are independent. For example it seems to perform worse in terms ofvariability to ETSI even though they have very similar detection rate in certainsituations. This would indicate that they detect speech differently meaning thateven if they have the exact same TPR and FPR the noise estimations would stilldiffer. The higher variability could indicate that Aurora have a harder time detectinglong speech segments or speech onset. This could possibly be explained by Aurorarelying on the hang-over scheme and recursive values to detect the ongoing speech.The longer the speech segment the harder for it to continue detecting it as therecursive averages will more and more resemble that of speech.

5.3 VAD vs. CNE

The SMVAD was never tested as a VAD noise estimator. It is instead assumedthat if it was able to obtain equal performance to either ETSI or Aurora that itwould achieve similar estimation result. This assumption might not be valid as ithas already been indicated that ETSI and Aurora did estimate the noise slightlydifferently even though they had almost the same TPR and FPR. Nevertheless itis reasonable to believe that it will still give indications of the noise estimationperformance if it were to be used as a VAD noise estimator.

55

It is not surprising that the VAD and CNE does in fact estimate the noisedifferently. The idea of continuous estimation is good in theory but the question isif it’s better than employing a binary speech decision in practice. First of all bothneed to be empirically tuned but it could be argued that employing a statisticalmodel makes it easier to tune the parameters. Secondly, the CNE (or the SMVAD)schemes are not reliant on any kind of signal strength which would make thembetter in a wider range of possible scenarios. However, it should be noted thatwhen ETSI and Aurora operate within a certain scenario range they do in facthave better or equal noise estimation characteristics to the CNE. The biggest gainmoving from a VAD noise estimator to a CNE seems to be in estimator stabilitywhich was apparent when some kind of minimum tracking was used together with aspeech presence uncertainty measure, i.e either IMCRA or RNE. These two methodsshowed similar estimation characteristics for every noise type and SNR level and thiscan not be said about neither ETSI nor Aurora as these were much more affectedby environmental differences.

Relying on minimum tracking ensures that there is a reasonably up-to-datenoise estimate whereas relying on a VAD decision does not. At low SNR bothETSI and Aurora suffered from being locked in speech mode for long periods onlydetecting the weakest noise, and while this did not result in any overestimation it didhowever underestimate and did not track any changes in the noise characteristics.Hence, a lot of residual noise would still be present in the signal after applying noisesuppression.

Using a CNE instead of a VAD based noise update seem to increase estimatorrobustness especially in non stationary noise and for low SNR. However, as expectedit does little when the condition is fine or when the noise is stationary. This bringsus to an important observation namely that it would appear promising to combinea VAD and a CNE scheme and use them both for different situations. A simpleway of controlling the switch between method used could be to monitor the speechdetection and when it is unreasonably high it would switch estimation method.

5.4 Critique

5.4.1 Evaluation Format

One important thing to note is the problem of having the similar structure forevery sentence which means that this evaluation will be in part tied to how wellthese methods work with this sentence structuring. A more comprehensive studywould include speech with a structure more similar to a conversation. Also, thenoise sources chosen only represent a small set of possible noises. It should benoted that the way the signals were concatenated does make the transition betweenthe sentences a bit unnatural, however this should have minimal effect. On theother hand it could also be argued that this just adds an additional dimension

56

to the testing as it could simulate an abrupt change in the noise. The transitionprobabilities for this implementation were estimated on the same samples it wasused on. Which introduces an unfairness in the evaluation process biased towardsthe HMM hang-over scheme. They should instead have been estimated on a muchlarger set of speech with varying structure to be then used on another speech set toverify its validity.

While the ROC does give a clear image of differences in the overall classifierperformance it does not give insight to how the different methods detect speech. Aspreviously discussed the same detection rate does not necessarily imply the samenoise estimation characteristics. The mis detection of voiced and unvoiced soundscannot be considered equal and neither can the mis detection of a weak versus astrong speech component be. Some additional evaluation of what kind of speech themethods detect differently and how this would affect the noise estimate would be ahelpful addition.

A problem with using the squared error measures for evaluating the noise char-acteristics is that it is not currently known how these measures explain differences inperceived speech quality if the estimated noise is used for purposes of noise suppres-sion. Therefore the evaluation of the CNE based on squared error measures needs tobe accompanied with more information to be able to evaluate its performance as anoise estimator for purposes of speech enhancement. This measure also treats noiseunder and overestimation equally, though overestimations are able to attain highervalues, and those errors cannot be considered the same. It is however safe to saythat a noise estimator with perfect squared error values would be a incredibly goodnoise estimator. Another thing to note is that as previously shown most methodshad trouble with the squared error values for high SNR but it is important to keepin mind that these high values are only relative to the noise strength. When thenoise strength is low overestimations might still not be very audible as the speechis so prominent.

A drawback of just simulating the results for the evaluation process is thatit is hard to verify the validity of the results. It is possible that there are someerrors in the implementation of Aurora so that it does not represent the algorithmemployed in the teleconference phone. To avoid this the discussed methods shouldbe evaluated by signals picked up by the device and Aurora should be evaluated inreal-time as implemented by Konftel. It would seem that Aurora should performbetter than ETSI as it employs spectral analysis, however as implemented in thiswork it does not which could indicate implementation issues.

5.4.2 Model Assumptions

It is not surprising that the classifiers and noise estimators based on the statisticalmodel performed best in Gaussian noise as this scenario at least fulfills part of themodel assumptions. As mentioned earlier the assumption that speech and noise is aGaussian process is a debatable one but even though the actual distribution might

57

be different from what was assumed the model seems to be robust enough to handlemost deviations.

5.5 Conclusion

The best SMVAD implementation using the ML estimation for IMCRA can beconsidered a better classifier compared to the implemented version of Aurora. Fur-thermore, it has been shown that with the heuristic additions it is possible to definean even better VAD in most of the tested conditions and is therefore proposed foruse. Using the originally proposed DD a priori SNR estimator with IMCRA have anadverse effect on the ability to correctly detect speech when used in together withthe SMVAD.

Employing the SMVAD with for example IMCRA would yield two possibilitiesof getting a noise update. In good conditions with high SNR and stationary orclose to stationary noise the VAD estimator could be used effectively, but as thecondition worsens, i.e. the noise increases or becomes more non-stationary, it couldswitch over to rely on the CNE estimation. By implementing the aforementionedscheme this becomes a possibility in contrast to using Aurora. However, this wouldrequire additional testing and evaluation.

While the SMVAD has been widely used due to its simplicity and good classi-fying ability it is obvious that it is very reliant on what kind on noise updating itutilizes. Employing the originally proposed LRA estimation yields far worse resultsthan combining it with a more robust CNE like IMCRA.

In terms of noise estimator capabilities both ETSI and Aurora did better orequal to IMCRA and RNE in high SNR, whereas the latter two did better in lowerSNR.

5.6 Further Studies

• It would seem to be favourable to combine a VAD and a CNE in order todefine a very robust noise estimator. How to efficiently aggregate these twoand to evaluate the effect on the noise estimations in various environmentalconditions could be valuable.

• As this work only considered the task of estimating the noise it would be im-portant to know how these different methods actually perform when employedin an audio processing algorithm including noise suppression, echo-cancellationand double-talk detection.

58

Bibliography

[1] J. Benesty, S. Makina, and J. Chen, Speech enhancement, Signal and Commu-nication Technology, New York: Springer, 2005.

[2] C. M. Bishop, Pattern recognition and mechine learning, New York: SpringerScience+Business Media, 2006.

[3] J.H. Chang, N.S. Kim, and S.K. Mitra, ”Voice activity detection based on mul-tiple statistical models”, IEEE Transactions on Signal Processing 54 (2006),no. 6, 1965–1976.

[4] I. Cohen, ”Noise Spectrum Estimation in Adverse Environments: ImprovedMinima Controlled Recursive Averaging”, IEEE Transactions on Speech andAudio Processing 11 (2003), no. 5, 466–475.

[5] G. Doblinger, Computationally efficient speech enhancement by spectral minimatracking in subbands, Tech. report, Teknische Universitat Wien, 1995.

[6] Y. Ephraim and D. Malah, ”Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator”, IEEE Transactionson Acoustics, Speech, and Signal Processing ASSP-32 (1984), no. 6, 1109–1121.

[7] , ”Speech Enhancement Using a Minimum Mean-Square Error Log-Spectra Amplitude Estimator”, IEEE Transactions on Acoustics, Speech, andSignal Processing ASSP-33 (1985), no. 2, 443–445.

[8] ETSI, Speech processing, transmission and quality aspects (stq); distributedspeech recognition; advanced front-end feature extraction algorithm; compres-sion algorithms, Tech. report, European Telecommunications Standards Insti-tute, 2007.

[9] F. A. Everest, The master handbook of acoustics, New York: TAB Books, 1994.

[10] T. Fawcett, Roc graphs: Notes and practical considerations for researchers,Tech. report, HP Laboratories, March 2004.

59

[11] S. Gazor and W. Zhang, ”Speech Probability Distribution”, IEEE Signal Pro-cessing Letters 10 (2003), no. 7, 204–507.

[12] H.W. Gierlich and F. Kettler, Background noise transmission and comfort noiseinsertion: The influence of signal processing on” speech”-quality in complextelecommunication scenarios, Tech. report, HEAD acoustics GmbH, 2007.

[13] Y. Huang and J. Benesty (eds.), Audio signal processing for next-generationmultimedia communication systems, Springer US, 2004.

[14] ITU, Test signals for use in telephonometry, Tech. report, TelecommunicationStandardization Sector of ITU, 01 2012.

[15] D. M. Kaplan, ”Improved quantile inference via fixed-smoothing asymptoticsand Edgeworth expansion”, Journal of Econometrics 185 (2015), no. 1, 20–32.

[16] L. Lin, W.H. Holmes, and E. Ambikairajah, ”Adaptive noise estimation algo-rithm for speech enhancement”, Electronic Letters 39 (2003), no. 9, 754–755.

[17] P. C. Loizou, Speech enhancement: Theory and practice, 2 ed., New York:Taylor and Francis Group, 2013.

[18] S. E. Malm and T. Britton, Stokastik, 1 ed., Stockholm: Liber, 2008.

[19] R. Martin, ”Noise Power Spectral Density Estimation Based on OptimalSmoothing and Minimum Statistics”, IEEE Transactions on Speech and AudioProcessing 9 (2001), no. 5, 504–512.

[20] M. Marzinzik and B. Kollmeier, ”Speech Pause Detection for Noise SpectrumEstimation by Tracking Power Envelope Dynamics”, IEEE Transactions onSpeech and Audio Processing 10 (2002), no. 2, 109–118.

[21] A. V. Oppenheim and A. S. Willsky, Signals & systems, 2 ed., Upper SaddleRiver: Prentice Hall, 1997.

[22] J. Ramırez, J. M. Gorriz, and J. C Segura, Voice activity detection. funda-mentals and speech recognition system robustness”, Tech. report, University ofGranada, June 2007.

[23] J. Ramirez, J C. Segura, C. Benitez, L. Garcia, and A. Rubio, ”Statistical VoiceActivity Detection Using a Multiple Observation Likelihood Ratio Test”, IEEESignal Processing Letters 12 (2005), no. 10, 689–692.

[24] S. Rangachari and P. C. Loizou, ”A noise-estimation algorithm for highly non-stationary environments”, Speech Communication (2006), no. 48, 220–231.

[25] M. Z. Salas, Overview of single channel noise suppression algorithm, Tech.report, Purdue University, October 2010.

60

[26] J. Sohn, N.K. Kim, and W. Sung, ”A Statistical Model-Based Voice ActivityDetection”, IEEE Signal Processing Letters 6 (1999), no. 1, 1–3.

[27] J. Sohn and W. Sung, ”A Voice Activity Detector Employing Soft DecisionBased Noise Spectrum Estimation”, IEEE ICASSP 1 (1998), 365–368.

[28] P. Stoica and R. Moses, Spectral analysis of signals, Upper Saddle River: Pren-tice Hall, 2005.

[29] K. K Talukdar and W. D. Lawing, ”Estimation of the parameters of the Ricedistribution”, The Journal of the Acoustical Society of America 89 (1991),no. 3, 1193–1197.

[30] M. Weeks, Digital signal processing using matlab and wavelets, 2 ed., Sudbury:Jones and Bartlett Publishers, 2011.

61

Documents

Vo ice Activity Detection and Noise Estimation for ...852787/FULLTEXT01.pdf · VOICE ACTIVITY DETECTION AND NOISE ESTIMATION FOR TELECONFERENCE PHONES Submitted in partial ful llment