CHAPTER 4 SPEECH ENHANCEMENT USING MULTI-BAND …shodhganga.inflibnet.ac.in/bitstream/10603/19043/12/12_chapter 4.p… · Based on frequency response characteristics of human ear,

55

CHAPTER 4

SPEECH ENHANCEMENT USING MULTI-BAND WIENER FILTER

4.1 Introduction

In real environmental conditions the speech signal may be

superimposed by the environmental interference. In general, the

spectrum of this noise varies with frequency. Since environmental

noise is colored, original speech spectrum will not be affected

uniformly by this noise. Recently researchers focussed on this issue

and some findings are given in [64-66]. Sub-band single channel

speech enhancement systems are developed by dividing the whole

noisy speech spectrum into frequency sub-bands [65], [67-68] and

this division is based on frequency response characteristics of human

ear. From the literature, Bark scale is well suited for sub-banding. The

power spectral density of colored noise is not constant, where as white

noise has constant value. Thereby doing speech enhancement by

multiplying the noisy speech spectrum with a same weighting factor

will distort the speech signal. Hence, it is necessary to perform speech

enhancement using different weighting factors for different sub-bands.

Thereby, speech distortion will be reduced while most of the musical

noise is eliminated.

This chapter deals with multi-band noise suppression techniques,

in which different weighting factors are used in each sub-band. To

resembling the frequency response characteristic of human ear,

56

frequency spectrum noise speech is divided into sub-bands based on

nonlinear Bark scale.

4.2 Multi-band Speech Enhancement

In the earlier proposed spectral subtraction [8], [39]. The authors

have assumed that noise and speech signal are uncorrelated, then the

cross correlation terms between clean speech signal and noise signal

are neglected. If we considered speech with background noise )(ny , as

combination of original speech signal )(nx , which is additively affected

by background noise )(nd then the noisy signal is given by

)()()( ndnxny (4.2.1)

This assumption is true while speech signal is stationary but it is not

so in. By applying the FFT to (4.2.1), at the thm frame and thk frequency

bin, )(ny can be represented as:

),(),(),( kmDkmXkmY (4.2.2)

where ),( kmY , ),( kmX and ),( kmD are the DFT coefficients of speech with

background noise, clean speech and noise signals respectively. The

power spectrum noise corrupted speech signal is given by

),(),(),(),(),(),(),(222

kmDkmXkmDkmXkmDkmXkmY (4.2.3)

where ),( kmD and ),( kmX denote their complex conjugates. The

function 2),( kmX is power spectrum of original speech signal.

In spectral subtraction, 2),( kmX is estimated from (4.2.3). In

equation (4.2.3), the terms 2),( kmD , ),(),( kmDkmX and ),(),( kmDkmX are

not possible to obtain their values directly and we use expected values

57

instead of their true values. 2

),( kmDE is obtained using method

proposed in the chapter 3. In power spectrum subtraction,

assumption is made on )(nd as it is zero mean and independent to

)(nx and then expected values of the terms ),(),( kmDkmX and

),(),( kmDkmX will be zero. Finally based on those assumptions the

original speech signal can be obtained as

2

),(),(),(ˆ 22

kmDEkmYkmX (4.2.4)

4.2.1 Cross correlation Terms in Spectral Subtraction

The assumption of the earlier researchers that statistical

characteristics of speech and noise are not correlated, limits the

performance of the spectral subtraction algorithm. This is not a valid

assumption in real world environments. The autocorrelation

sequences of one frame of a clean speech, together with the

background and noisy version of the same speech signal are shown in

Fig.4.1. From the graphical illustration shown in Fig. 4.1, it is

observed that autocorrelation sequence of speech signal with

background noise is not same as that obtained by summing the

autocorrelation sequences of the noise and original speech signal.

Fig.4.1 shows the existence of the cross correlation between the clean

speech signal and noise signal [69-70].

Therefore, the terms ),(),( * kmDkmX and ),(),(* kmDkmX cannot be

neglected. By including these terms, the musical noise from the

processed speech signal can be reduced. As the discussion is on single

58

channel noise suppression techniques, it is not possible to have the

samples of original speech. In order to approximate and include the

cross correlation terms, a spectrum of noisy speech is used to

estimate

Fig.4.1 Verification of cross correlation existence between original

speech and noise signal.

),(),( * kmDkmY and ),(),(* kmDkmY instead of

),(),( * kmDkmX and ),(),(* kmDkmX . This cross correlation between

spectra of speech signal with background noise and noise signal is

obtained using the cross correlation coefficient, . This coefficient is

given by

10,.

.

dy

dyyd (4.2.5)

59

where

1

0

)()(1 N

i

yd iDiYM

1

0

)(1 N

i

y iYM

1

0

)(ˆ1 N

i

d iDM

1

0

22 )(1 N

i

yy iYM

1

0

22 )(ˆ1 N

i

dd iDM

where M is length of the FFT frame. The value is proportional to the

degree of cross correlation between clean speech and noise.

4.2.2 Mel-scale Spectral Subtraction

In most of the cases, the environment noise has its power spectral

density changing with frequency and is unlike white noise. This noise

is known as color noise, which affects the speech spectrum differently

at different frequency bands. Considering machine noise, which has

characteristics as low frequencies, contains most of its energy. A

multi-band spectral subtraction was proposed by S.Kamath et al. [65],

in which noisy speech spectrum is divided into sub-bands based on

linear frequency spacing approach. However they did not consider any

cross correlation terms.

In this approach, frequency spectrum of noisy signal is divided by

using non-linear spacing between sub-bands for spectral subtraction

and cross correlation terms are also included in subtraction process.

In this work mel-scale is used for dividing the human voice frequency

range 31 sub-bands. Table 4.1 shows frequency ranges for sub-bands.

60

Using (4.2.6) input noisy speech signal frequency components are

converted into mel-scale.

1

700log2595 10

fmf (4.2.6)

And the sub-band noisy speech spectrum is obtained by using

KikmYmYik

i ,..,1,),()(2

)(

(4.2.7)

where i is the sub-band number, K=31 is the total number of sub-

bands and )(ik is the frequency index related to lower and upper

boundaries of the sub-band i, depending on the lower and upper

frequency boundary of the critical band i.

Table 4.1 Sub-bands on mel-scale

Sub-band Number of Bins Frequency(Hz)

0

1

….

12

13

14

15

16

17

18

19

20

21

22

23

1

1

….

1

2

2

2

2

2

2

2

3

3

4

6

0-31

31-62

….

375-406

406-469

469-531

531-594

594-656

656-719

719-781

781-844

844-938

938-1031

1031-1156

1156-1344

61

24

25

26

27

28

29

30

31

6

8

9

10

12

14

18

9

1344-1531

1531-1781

1781-2063

2063-2375

2375-2750

2750-3188

3188-3750

3750-4000

According to mel-scale, the speech spectrum is divided into K

number of sub-bands and enhancement is done by subtracting the

estimated noise in each sub-band. The original speech signal

spectrum can be obtained in the ith sub-band using

otherwisekmD

kmXifkmDkmYkmDkmYkmX

i

iiiii

,),(ˆ

0),(ˆ,),(ˆ),(),(ˆ),(),(ˆ

2

222

2

(4.2.8)

for 1 ii kmk

where ik and 1ik are related to starting and ending frequencies of the

ith sub-band and i is the forgetting factor of the ith sub-band. The

choice of the value of parameter i dictates the amount of noise

eliminated. To provide the best trade-off between audible distortion

and residual noise peaks, this subtraction factor, i

should be

selected appropriately. If 1i , then it indicates over spectral

subtraction and in this case noisy speech spectrum is over

attenuated. To avoid this over attenuation, flooring parameter is used

62

and it will give the minimum value to the gain of the subtraction filter

but masking occurs to the residual noise.

Equation (4.2.8) represents estimated enhanced speech signal

spectrum which have to converge at the original speech signal.

However, enhanced speech signal of this method is not converging to

the original speech signal because of the non-stationary nature of

speech signals. This spectral subtraction is also represented as a

filter, with filter gain ),( kmG and value of this gain is in the range 0 to

1. Now enhanced speech spectrum is obtained by passing the noisy

speech signal through this filter and it is given by

1),(0);,().,(),(ˆ kmGkmYkmGkmX (4.2.9)

comparing (4.2.4) and (4.2.5), we have ),( kG as

),(

),(ˆ

1),(kmY

kmD

SNRkmG

i

i

post

. (4.2.10)

where

1

1

2

2

),(ˆ

),(

i

i

i

i

w

wm

i

w

wm

i

post

kmD

kmY

SNRis a posterior SNR. According to (4.2.10),

this filter gain function is dependent on a posterior SNR. In the

regions of estimated noise higher than the noisy speech power

spectrum, ),( kmG is equal to zero. Hence there is a trade-off between a

posterior SNR and attenuation of noisy speech. While a posterior SNR

is increasing the attenuation of the noisy speech reduces.

Implementation of this approach is simple because a posterior SNR

can be obtained easily.

63

4.2.3 Multi-band Wiener Filter

Based on frequency response characteristics of human ear, here a

multi-band approach based on non-linear sub-bands is proposed.

According to psychoacoustics of human ear, a spectral gain for

enhancement is proposed in [68]. Here Bark scale is used for dividing

the human voice frequency range into 24 sub-bands which is suitable

to represent band pass filtering nature of human ear. The relationship

between the input frequency and Bark is given in (4.2.11) and is also

2

5.7arctan5.3)76.0arctan(13)(

fffb . (4.2.11)

Graphically representation is given in Fig.4.2. Out of 24 sub-bands,

Fig.4.2 (a) critical band rate and (b) frequency

18 sub-bands are sufficient to represent sampled speech signal with

sampling frequency 8 KHz. Noisy speech power spectrum on Bark

scale is given by

KikYbYi

i 2,1,)()()(

2

. (4.2.12)

where i indicates index of sub-band, K=18 is the total number of sub-

bands and )(i represents the frequency range of the sub-band i.

Now conventional Wiener filtering is used in each sub-band for

noise elimination and noisy speech is multiplied with this filter gain in

each sub-band as

64

)()(ˆ kYGkX iii (4.2.13)

Wiener filter is designed by applying MMSE criteria between clean

speech signal and processed sub-band signals. To derive the Wiener

filter gain, cost function is formulated in each sub-band, which is

given by

2

)()(ˆ kXkXE iii . (4.2.14)

where )(ˆ kX i and )(kX i denote the estimated and ideal sub-band

speech signals in the ith sub-band respectively. By substituting

(4.2.13) in (4.2.14) and simplifying, we get i as

)()(12)()(1 2222kDkXEGGkDEGkXEG iiiiiiiii . (4.2.15)

where )(kDi the estimated noise spectrum, is zero mean and assumed

that it is uncorrelated with )(kX i in each sub-band. The (4.2.15) can

be simplified to

)()(1 2222kDEGkXEG iiiii (4.2.16)

By the differentiation of (4.2.16) w.r.t weighting factor iG to zero and

the weighting factor iG is found to be

)()(

)(22

2

kDEkXE

kXEG

ii

i

i

2

2

22

2

i

i

ii

i

Y

S

DS

S

. (4.2.17)

Where 2

iS , 2

iD and 2

iY indicate estimated power of original speech,

background noise and speech with background noise in the ith sub-

band respectively.

65

By considering cross correlation between )(kX i and )(kDi and

differentiating (4.2.15) w.r.t iG we have

)()(2)()(

)()()(22

2

kDkXEkDEkXE

kDkXEkXEG

iiii

iiii

. (4.2.18)

However it is not possible to estimate cross correlation term, since we

are dealing with single channel, no reference signal for original speech

is available but we have noisy speech signal )(kYi . Hence, cross

correlation between )(kX i and )(kDi is estimated using the available

)(kYi and estimated )(kDi , then

)(.)()()()( kDkDkXEkDkYE iiiii

)()()()()( 2 kDEkDkYEkDkXE iiiii . (4.2.19)

By considering the cross correlation between )(kYi and )(kDi ,

)()()()( kDkYkDkYE iiii (4.2.20)

where is the cross correlation coefficient which relates the cross

correlation between noise corrupted speech and noise signals in a

sub-band. This is given in (4.2.5). By substituting (4.2.20) and

(4.2.19) in (4.2.18), filter gain iG can be obtained as

2

22 )()()()(

i

iiiii

YE

kDEkDkYkXEG

)(

1)(

)()()(

2

k

kDE

kDkYk

Gi

i

ii

i

i

(4.2.21)

66

where )(ki is a priori SNR in a ith sub-band and given by

2

2

)(

)()(

kDE

kXEk

i

i

i and similarly )(ki is a posterior SNR and defined as

2

2

)(

)()(

kDE

kYEk

i

i

i . Calculation of a priori SNR is not possible directly.

Then it is estimated using the approach given in [34][71] and is given

by

1)(1

)(ˆ

)(ˆ

)(ˆ2

2

kP

kDE

kXk i

i

i

i . (4.2.22)

where )(ˆ kX i estimator of the original speech is signal in the ith sub-

band,

2

)(ˆ kDE i is the estimation of the noise power in the thi sub-

band, ][P performs half wave rectification and denotes smoothing

factor. Initial a priori SNR is obtained using the ML estimate and is

given by )1()( kEk ii .

4.3 Modified A priori SNR

This section deals with the modified a priori SNR. Estimation and

updating of )(ˆ ki is controlled by the smoothing factor . Approach

given in [67] analyzed the estimation a priori SNR and noise reduction

based on smoothing factor. When closer to1, then less amount of

residual noise is remains in enhanced speech signal but transient

distortion will be more. Balancing these effects researchers [8][72] set

the value in the range 0.95-0.99. If we choose the constant value ,

67

then )(ˆ ki fails to follow sudden changes in the speech signal

magnitude and after a certain time delay it will follow the a posterior

SNR. Similarly after a certain time delay )(ˆ ki will follow the downfalls

in the a posterior SNR, these downfalls correspond to the decrement

in the speech signal magnitude. It is convenient to use smaller values

to in these transitional areas. To get better performance with the a

priori SNR )(ˆ ki an adaptation method is proposed for smoothing

parameter in [72], in which authors made assumption that the

statistical characteristics of the background noise do not change with

frame and are stationary. According to [72], is defined in terms of

frame energy k im kmYE2

),( and is given by

1

1

,max1

EE

EE (4.3.1)

In this thesis, a modified smoothing factor is proposed which is

defined using MMSE criteria. This factor is suitable for adaptively

changing the a priori SNR corresponding to changes in speech signal

magnitude. Modified a priori SNR can be obtained

]1),([),(1),1(~

),(),(ˆ kmPkmkmkmkm iiiii . (4.3.2)

Where 22

),1(/),1(ˆ),1(~

kmDEkmXkm iii and which is similar to a

priori SNR ),( kmi . Here cost function J is formulated based on

MMSE criteria to obtain modified smoothing factor.

),1(~

/),(),(ˆ2

kmkmkmEJ iii (4.3.3)

68

Substituting (4.3.2) into (4.3.3), an expression for J can be obtained

as

2222 )1),((),(1),(),1(~

),( kmkmkmkmkmJ iiii . (4.3.4)

Now differentiating J with respect to ),( kmi and setting this value to

zero. Then smoothing factor ),( kmi has been modified as

2

)1,(

),1(~

),(1

1,

km

kmkmkm

i

ii

opt

i

. (4.3.5)

In the derivation ),( kmi the following terms are substituted:

1),(2),(21),(22

kmkmkmE iii

and ),(2),(/),(244

kmkmDEkmXE iii ,it is the fourth order

moment, discussion on this moment is given in[69]. Initially, it is not

possible to have ),( kmi then 1),(),( kmPkm ii is substituted

for ),( kmi to get approximated value of ),( kmi . This modified

smoothing factor will have smaller value, while sudden changes in the

speech signal magnitude or in a posterior SNR. When speech signal

has uniform magnitude variation, then value of the smoothing

parameter is nearer to 1. Fig.4.3 illustrates the comparison of

variations in a posterior with obtained smoothing parameter and

parameter given in [72]. The major difference between the obtained

smoothing parameter with parameter given in [72] is that modified

factor changes for every frame and frequency bin. This characteristic

is differing from parameter given in [72].

69

4.4 Simulation Results

The evaluation of multi-band single channel speech enhancement

methods is discussed in this chapter. For this evaluation, simulations

are carried out with the NOIZEUS, database [61] in MATLAB. This

database provides noisy speech samples at different environments at

different SNRs. For this performance evaluation airport, car, babble,

train, restaurant, railway station and street environment noises are

considered at 0 dB, 5dB, 10 dB and 15dB SNR levels.

Fig.4.3 Variation of smoothing parameter ( ) (a) Noisy speech (b)

Proposed ),( km (in dotted line) and aposteriori SNR, )(k (in

solid line) (c) m of [65] (in dotted line) and aposteriori SNR,

)(k (in solid line) (d) apriori SNR, )(k estimate using

proposed ),( km for 37k (in solid line) and apriori SNR

estimate using m .

The following performance measures are used in this evaluation

procedure: segmental SNR and Noise Reduction NR values. Higher

70

values of these measures indicate significant reduction in the

background noise [54]. Details of these performance measures are

already discussed in section 2.6. The performance of the mel-scale

spectral subtraction and multi-band Wiener filter is compared with

power spectral subtraction and Wiener filter.

Table 4.2 shows the output average segmental SNR values of

enhanced speech signals obtained with proposed method and with

other methods. The performance of the multi-band Wiener filter

method almost outperforms that of the power spectral subtraction,

Wiener filter and mel-scale spectral subtraction. Fig.4.4 illustrates

graphical representation of the comparison of output average

segmental SNR values for different noise environment conditions.

Table 4.3 illustrates the comparison of noise reduction values.

Form this table it is clear that background noise reduction from noisy

speech signal with proposed method is better compared to other

methods. Similar comparison is given in Fig.4.5. The timing

waveforms of the enhanced speech are given in Fig.4.6, where original

speech signal is corrupted with airport noise at 0 dB SNR. Similar

comparison is given in Fig.4.7 in terms of spectrograms. From these

results and listening tests it is clear that multi-band Wiener filter

efficiently removes the background noise compared to other methods.

71

Table 4.2 Output average segmental SNR values (dB) Type of noise and SNR (dB)

Power Spectral Subtraction (PSS)

Wiener Filter (WF)

Mel-scale

Spectral Subtraction

(Mel-scale SS)

Multi-band Wiener Filter

(MWF)

Airport-0 -5.55 -4.06 -4.37 -2.39

Airport-5 -2.61 -2.23 -2.57 0.67

Airport-10 -0.39 -0.68 -0.06 0.43

Airport-15 0.18 0.77 1.88 3.13

Babble-0 -5.87 -4.40 -4.59 -1.91

Babble-5 -1.96 -1.80 -1.39 0.05

Babble-10 -0.55 -0.16 0.03 2.36

Babble-15 0.27 0.76 2.71 3.06

Car-0 -4.55 -4.48 -3.93 -1.02

Car-5 -1.48 -1.90 -1.65 1.69

Car-10 0.23 -0.08 0.68 2.40

Car-15 0.48 0.75 2.31 2.71

Street-0 -4.21 -3.16 -2.88 -1.97

Street-5 -1.82 -2.28 -2.13 -0.29

Street-10 -0.74 -0.31 1.20 2.42

Street-15 0.45 0.58 2.25 2.48

Train-0 -3.69 -3.45 -3.45 -2.13

Train-5 -0.98 -1.75 -0.86 0.93

Train-10 -1.95 -0.81 -0.39 1.69

Train-15 0.46 0.49 2.62 2.57

Restaurant-0 -17.73 -5.11 -5.49 -3.44

Restaurant-5 -12.96 -2.72 -3.61 -0.15

Restaurant-10 -8.40 -0.66 -0.49 1.28

Restaurant-15 -3.43 0.49 1.80 2.47

Station-0 -4.21 -3.72 -3.62 0.51

Station-5 -1.82 -1.83 -1.93 1.18

Station-10 -0.74 -0.36 0.95 2.39

Station-15 0.45 0.76 2.72 2.86

72

Fig.4.4 Comparison of output average segmental SNR values for

different noises (a) airport (b) babble (c) car (d) restaurant (e) station and (f) street

73

Table 4.3 Noise reduction values (dB) Type of noise and SNR (dB)

Power Spectral

Subtraction (PSS)

Wiener Filter (WF)

Mel-scale

Spectral Subtraction

(Mel-scale SS)

Multi-band

Wiener Filter (MWF)

Airport-0 18.74 13.15 25.37 25.00

Airport-5 19.98 16.95 24.68 25.98

Airport-10 20.29 21.79 23.21 24.01

Airport-15 22.78 22.91 23.04 26.22

Babble-0 18.12 13.21 24.73 25.9

Babble-5 20.19 16.50 24.70 25.85

Babble-10 21.54 19.74 23.91 26.60

Babble-15 21.55 23.29 22.99 26.12

Car-0 18.45 13.38 25.63 26.30

Car-5 21.08 16.17 25.11 27.69

Car-10 22.68 19.48 24.20 26.89

Car-15 23.70 22.96 23.07 25.81

Street-0 19.08 13.97 25.06 25.34

Street-5 19.38 16.53 25.43 25.36

Street-10 22.64 19.64 24.14 25.86

Street-15 23.21 23.54 23.54 25.24

Train-0 17.45 13.59 25.87 25.63

Train-5 20.06 16.61 24.89 26.64

Train-10 21.12 19.87 24.20 25.96

Train-15 23.72 23.68 22.86 25.44

Restaurant-0 19.16 13.92 25.41 23.50

Restaurant-5 20.19 16.69 25.13 25.42

Restaurant-10 21.43 19.95 24.15 25.61

Restaurant-15 22.53 23.32 23.10 25.42

Station-0 19.06 13.10 25.44 28.37

Station-5 20.91 16.12 24.81 27.11

Station-10 22.47 19.48 23.63 26.76

Station-15 23.86 22.91 22.92 25.87

74

Fig.4.5 Comparison of noise reduction values with various

enhancement techniques and multi-band Wiener filter for different noises (a) airport (b) babble (c) car (d) street (e) train and (f) restaurant.

75

Fig.4.6 Timing waveforms of (a) the original speech (b) speech with

background noise and the enhanced speech using (c) power spectral subtraction (d) Wiener filter (e) mel-scale spectral subtraction and (f) multi-band Wiener filter

76

Fig.4.7 Comparison of spectrograms of (a) the original speech (b)

speech with background noise and the enhanced speech using (c) power spectral subtraction (d) Wiener filter and (e) mel-scale spectral subtraction (f) multi-band Wiener filter

4.5 Conclusion

This chapter described the development of a new multi-band

speech enhancement system that considered the affect on speech

signal spectrum due to colored noise. In particular, mel-scale spectral

subtraction and multi-band Wiener filter methods are examined.

Performance of these methods is compared with power spectral

subtraction and Wiener filtering methods. From simulation results it

77

is observed that multi-band Wiener filter method gives better

performance than other methods.

Chapter 5 describes reduction of harmonic distortion caused

during multi-band wiener filter speech enhancement.

Documents

CHAPTER 4 SPEECH ENHANCEMENT USING MULTI-BAND …shodhganga.inflibnet.ac.in/bitstream/10603/19043/12/12_chapter 4.p… · Based on frequency response characteristics of human ear,