6
SPEECH ENHANCEMENT EMPLOYING A SIGMOID -TYPE GAIN FUNCTION WITH A MODIFIED A PRIORI SIGNAL-TO-NOISE RATIO (SNR) ESTIMATOR Md. Jahangir Alam 1 , Douglas O’Shaughnessy 1 and Sid-Ahmed Selouani 2 1 INRS-Energie-Matériaux-Télécommunications, Université du Québec, Montréal, Canada 2 Université de Moncton, Campus de Shippagan, NB, Canada ABSTRACT This paper presents a sigmoid type gain function with a modified a priori signal-to-noise ratio (SNR) estimation approach to single channel speech enhancement in noisy environments. Frequency domain noise reduction techniques are often defined in terms of the a priori SNR. A widely used method to determine the a priori SNR from noisy speech is the decision directed (DD) approach. In the DD approach the a priori SNR depends on the speech spectrum estimation in the previous frame which degrades the noise reduction performance. To overcome this problem a sigmoid type weighting function is proposed with a modified a priori SNR estimator. The performance of the proposed algorithm is evaluated by two objective tests under various noisy environments and it is found that the proposed sigmoidal- shaped gain function produces significant improvements in noise reduction performance compared to that of the conventional Wiener gain. Index Terms: Signal-to-noise ratio, Wiener filter, sigmoid function, speech enhancement 1. INTRODUCTION The problem of enhancing speech degraded by additive noise has been widely studied in the past and is still an active field of research. Relevant noise reduction techniques can be expressed as a spectral noise suppression gain based on the a priori SNR [1] - [4]. Application of the spectral gain results in the artifact that is known as the musical noise during noise frames. A method for significant removal of the musical noise is the DD estimation approach originally proposed in [4]. In [5] the performance of the DD estimation is analyzed and it was demonstrated that the musical noise is strongly reduced by the a priori SNR corresponding to a highly smoothed version of the a posteriori SNR in noise frames, while the a priori SNR follows the a posteriori SNR with a delay of one frame in speech frames. Therefore, in a DD scheme the estimated noise suppression gain using the delayed a priori SNR having a fixed weighting factor matches the previous frame rather than the current frame and thus it degrades the quality of the enhanced speech signal especially in abrupt transient parts. To suppress this problem a sigmoid type gain function with a modified a priori SNR estimation is proposed. The modified a priori SNR approach solves the delay problem while a sigmoid -type spectral gain function applies heavy attenuation with low SNR and less or no attenuation with high SNR. 2. CLASSICAL SPEECH ENHANCEMENT SYSTEM The distorted signal model in the time domain is expressed as () () () yn xn dn = + , (1) where () xn is the clean signal and () dn is the additive random noise signal, uncorrelated with the original signal. If at the mth frame and kth frequency bin ( , ) Ymk , ( , ) Xmk and ( , ) Dmk represent the spectral component of () yn , () xn , and () dn respectively then the distorted signal in the transformed domain is ( , ) ( , ) ( , ) Ymk Xmk Dmk = + . (2) An estimate ( , ) Xmk of ( , ) Xmk is given by ( , ) ( , ) ( , ) Xmk HmkYmk = , (3) where 0 ( , ) 1 Hmk < < , is the noise suppression gain, which is a function of a priori SNR and a posteriori SNR, given by ( , ) ( , ) 1 ( , ) mk Hmk mk ξ ξ = + , (4) where ( , ) mk ξ is the a priori SNR. The first parameter of the noise suppression rule is the a posteriori SNR given by ( ) 2 ( , ) ( , ) , d Ymk mk mk γ = Γ , (5) where ( ) 2 , { ( , )} d mk E Dmk Γ = is the noise power spectrum estimated during speech pauses using the classical recursive relation ( ) ( ) , 1, d D d mk m k λ Γ = Γ + 2 (1 ) ( , ) D Ymk λ , (6) where 0 1 D λ is the smoothing factor. We have chosen 000631 CCECE/CCGEI May 5-7 2008 Niagara Falls. Canada 978-1-4244-1643-1/08/$25.00 2008 IEEE

[IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

SPEECH ENHANCEMENT EMPLOYING A SIGMOID -TYPE GAIN FUNCTION WITH A MODIFIED A PRIORI SIGNAL-TO-NOISE RATIO (SNR) ESTIMATOR

Md. Jahangir Alam1, Douglas O’Shaughnessy1 and Sid-Ahmed Selouani2

1INRS-Energie-Matériaux-Télécommunications, Université du Québec, Montréal, Canada 2Université de Moncton, Campus de Shippagan, NB, Canada

ABSTRACT

This paper presents a sigmoid type gain function with a modified a priori signal-to-noise ratio (SNR) estimation approach to single channel speech enhancement in noisy environments. Frequency domain noise reduction techniques are often defined in terms of the a priori SNR. A widely used method to determine the a priori SNR from noisy speech is the decision directed (DD) approach. In the DD approach the a priori SNR depends on the speech spectrum estimation in the previous frame which degrades the noise reduction performance. To overcome this problem a sigmoid type weighting function is proposed with a modified a priori SNR estimator. The performance of the proposed algorithm is evaluated by two objective tests under various noisy environments and it is found that the proposed sigmoidal-shaped gain function produces significant improvements in noise reduction performance compared to that of the conventional Wiener gain. Index Terms: Signal-to-noise ratio, Wiener filter, sigmoid

function, speech enhancement

1. INTRODUCTION The problem of enhancing speech degraded by additive noise has been widely studied in the past and is still an active field of research. Relevant noise reduction techniques can be expressed as a spectral noise suppression gain based on the a priori SNR [1] - [4]. Application of the spectral gain results in the artifact that is known as the musical noise during noise frames. A method for significant removal of the musical noise is the DD estimation approach originally proposed in [4]. In [5] the performance of the DD estimation is analyzed and it was demonstrated that the musical noise is strongly reduced by the a priori SNR corresponding to a highly smoothed version of the a posteriori SNR in noise frames, while the a priori SNR follows the a posteriori SNR with a delay of one frame in speech frames. Therefore, in a DD scheme the estimated noise suppression gain using the delayed a priori SNR having a fixed weighting factor matches the previous frame rather than the current frame and thus it degrades the quality of the enhanced speech signal

especially in abrupt transient parts. To suppress this problem a sigmoid type gain function with a modified a priori SNR estimation is proposed. The modified a priori SNR approach solves the delay problem while a sigmoid -type spectral gain function applies heavy attenuation with low SNR and less or no attenuation with high SNR. 2. CLASSICAL SPEECH ENHANCEMENT SYSTEM

The distorted signal model in the time domain is expressed as

( ) ( ) ( )y n x n d n= + , (1)

where ( )x n is the clean signal and ( )d n is the additive

random noise signal, uncorrelated with the original signal. If at the mth frame and kth frequency bin ( , )Y m k , ( , )X m k and

( , )D m k represent the spectral component of ( )y n , ( )x n ,

and ( )d n respectively then the distorted signal in the

transformed domain is ( , ) ( , ) ( , )Y m k X m k D m k= + . (2)

An estimate � ( , )X m k of ( , )X m k is given by

� ( , ) ( , ) ( , )X m k H m k Y m k= , (3)

where 0 ( , ) 1H m k< < , is the noise suppression gain, which

is a function of a priori SNR and a posteriori SNR, given by

( , )

( , )1 ( , )

m kH m k

m kξ

ξ� �= � �+� �

, (4)

where ( , )m kξ is the a priori SNR.

The first parameter of the noise suppression rule is the a posteriori SNR given by

( )

2( , )

( , ),d

Y m km k

m kγ =

Γ , (5)

where ( )2

, { ( , ) }d m k E D m kΓ = is the noise power

spectrum estimated during speech pauses using the classical recursive relation ( ) ( ), 1,d D dm k m kλΓ = Γ − +

2

(1 ) ( , )D Y m kλ− , (6)

where 0 1Dλ≤ ≤ is the smoothing factor. We have chosen

000631

CCECE/CCGEI May 5-7 2008 Niagara Falls. Canada978-1-4244-1643-1/08/$25.00 � 2008 IEEE

Page 2: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

0.9Dλ = for all cases. {}.E is the expectation operator. The

a priori SNR, which is the second parameter of the noise suppression rule, is expressed as

( )( )

,( , )

,x

d

m km k

m kξ

Γ=

Γ , (7)

where { }2( , ) ( , )x m k E X m kΓ = is the clean speech power

spectrum. An estimate of the a priori SNR is made according to the conventional DD approach [4] as

�( )

2( 1, ) ( 1, )

,( , ) max DD

dDD

H m k Y m k

m km k αξ

� − −+�� Γ�

=

[ ] )min(1 ) ( , ) ,P m kα ϑ ξ′− , (8)

where [ ]P x x′ = if 0x ≥ and [ ] 0P x′ = otherwise, ( , )m kϑ is the instantaneous SNR [6], defined as

( )

2( , )

( , ) 1,d

Y m km k

m kϑ = −

Γ (9)

( , ) 1m kγ= − .

In this paper we have chosen 0.98α = and min .0032ξ =

(i.e., -25dB) on the basis of simulations. The spectral gain for the DD approach is

��

( , )( , )

1 ( , )DD

DD

DD

m kH m k

m k

ξ

ξ=

+ . (10)

The temporal domain denoised speech is obtained with the following relation

� � arg( ( , ))( ) ( ( , ) . )j Y m kx n IFFT X m k e= . (11)

3. THE MODIFIED APPROACH

To overcome the drawbacks of the conventional DD approach several modified a priori SNR estimation approaches have been proposed [6]-[11] while maintaining the benefits of the DD approach. In this paper for the proposed sigmoid type gain function we have used a priori SNR estimated using [10] with a slight modification. In this approach the MMSE estimation for 2 ( , )X m k can be

obtained from ( , )Y m k as follows:

� { }2

2( , ) ( , ) ( , )X m k E X m k Y m k=

{ } { }

{ } { }

2X P Y X P X dX

P Y X P X dX

−∞∞

−∞

�=

� (12)

where {}.P denotes the probability density function (pdf).For

simplicity of notation the frame index, m and frequency index k are dropped.

Assuming Gaussian distributions, { }P Y X and { }P X are

expressed as:

{ }

( )2

21

2

d

d

Y X

P Y X eπ

−� �−� Γ� �� �=

Γ, (13)

{ }

2

212

x

x

X

P X eπ

−� Γ� �=

Γ, (14)

where { }2x E XΓ = .

Now from (11)

( )

( )

22

22

2 22

2

2 2

d x

d x

Y X X

Y X X

X e dXX

e dX

−� �− −� ∞ Γ Γ� �� �

−∞

−� �− −� ∞ Γ Γ� �� �

−∞

�=

22

2

22

2

2 22

2 2

x d x

x d x d d

x d x

x d x d d

X Y

X Y

X e dX

e dX

� �Γ +Γ Γ� �� �− −� Γ Γ� �∞ Γ Γ +Γ� �� �� �

−∞

� �Γ +Γ Γ� �� �− −� Γ Γ� �∞ Γ Γ +Γ� �� �� �

−∞

�=

, (15)

Taking 2

22 2x d x

x d x d d

Z X YΓ + Γ Γ

= −Γ Γ Γ Γ + Γ

we have from (15)

{ }

{ }

2

2

2

2

2 Zx d x

x d x d

Z

Z Y e dZ

Xe dZ

∞ −

−∞

∞ −

−∞

� �Γ Γ Γ+� � �Γ + Γ Γ + Γ� �=

{ }

{ } ( )

2

2

22 2

2

2 Zx d

x d x

Zx d

Z e dZY

e dZ

∞ −

−∞

∞ −

−∞

Γ Γ�Γ + Γ Γ

= +Γ + Γ�

, (16)

( )

2 2

2x d x

x d x d

YΓ Γ Γ= +

Γ + Γ Γ + Γ .

To get (16), we have used following relations 2 0,2,...2 ( )

1,3,...0

forfor

qq at qI at e dt

q

∞−

−∞

== �

=�� , (17)

where ( )1

122( ) 0.5

qq

qI a a+

− += Γ , (18)

and ( ).Γ , is the gamma function expressed as

( ) 1

0

q tq t e dt∞

− −Γ = � . (19)

Using (7) and (10) in (16) the a priori SNR for the Modified approach is given as

�� 22 2

1 1DD DD

Md DD DD d

X Yξ ξξ

ξ ξ� �

= = + � �Γ + + Γ� �

000632

Page 3: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

2

1DD DDd

YH H� �= +� �Γ� �

. (20)

In [10], the transformation applied was DCT and �Mξ has

been expressed as

�� 22

22 (1.5)11

DD DDM

d DD dDD

X Yξ ξξ

ξπξ

Γ � �= = + � �Γ + Γ+ � �

. (21)

Using (8) the estimation of DD

ξ in (20) is given as

�( )

2( 1, ) ( 1, )

,( , ) max M

dDD

H m k Y m k

m km k αξ

� − −�� Γ�

=

[ ] )min(1 ) ( , ) ,P m kα ϑ ξ+ − , (22)

where ( , )MH m k is the gain for the Modified a priori SNR

estimation approach and for a Wiener filter it is given as

��

( , )( , )

1 ( , )M

M

M

m kH m k

m k

ξ

ξ=

+ . (23)

4. PROPOSED SIGMOID GAIN FUNCTION

The gain, ( , )H m k obtained using (4) emphasizes portions

of the spectrum where SNR is high and attenuates portions of the spectrum where SNR is low. In order to apply heavy attenuation when the SNR is low and little or no attenuation (unity gain) at high SNR, we have proposed a sigmoid shaped gain function. With a smoothing factor, the sigmoidal - shaped gain is given by

2 ( , )

2( , ) 11

pm k

H m ke μξ

= −−+

, (24)

where ( , )m kξ is the a priori SNR, 0 1μ< < is a smoothing

constant and 0 ( , ) 1pH m k≤ ≤ as ( , ) 0m kξ ≥ . � ( , )m kξ , an

estimation of ( , )m kξ is given by (20). Depending on the

value of μ , ( , )pH m k reaches unity for � max( , ) SNRm kξ ≥

and zero for � min( , ) SNRm kξ ≤ . SNRmax is inversely

proportional μ whereas SNRmin is directly proportional to μ .

We have chosen 0.25μ = and for this value of μ SNRmax

and SNRmin are found to be 14 dB and -4 dB, respectively. Although we have used a fixed value for μ , it is possible to

make it adaptive with the a priori or a posteriori SNR. Fig. 1 shows the variations of the proposed gain and the Wiener Gain with the a priori SNR. For the a priori SNR below -4 dB proposed gain goes to almost zero and for the a priori SNR above 13 dB it reaches unity gain. The value of SNRmax and SNRmin can be changed by changing the value of μ .

Fig. 1 Spectral Gains. Proposed gain (solid line), Wiener gain (dotted line).

5. EXPERIMENTAL RESULTS

In order to evaluate the performance of the proposed sigmoidal shaped gain function to that of the Wiener gain we conducted objective quality tests under various noisy environments. The noise signals include car noise, babble noise, exhibition hall noise and white noise taken from the Aurora Database. The speech signal is sampled at 8 kHz and degraded by these noises at the SNR of 0 dB, 5 dB, 10 dB, 15 dB and 20 dB. The signals are transformed into the STFT domain using 40% overlapping Hamming windows of 256 samples length (32 ms) and the signals are reconstructed using a standard overlap-add method. Fig. 1 shows comparison of the proposed gain with that of the Wiener filter gain. The segmental SNR measure is adopted for the objective evaluation of the proposed gain [12] [13]. Table1 shows the average segmental SNR for the enhanced speech signals in various types of noisy environments. It is observed that the proposed approach gives better segmental SNR (SegSNR) than that of the Wiener gain under all tested noisy environments. Table 2 shows the Cepstral Distance (CD), which is considered to be a human auditory measure, for the enhanced speech signals in various types of noise corruption. The modified a priori SNR with the Wiener Gain and Sigmoidal-shaped gain exhibit lower values of CD for all noisy environments compared to those obtained by the DD approach with Wiener Gain. Fig. 2 represents the spectrograms of the clean speech, noisy speech, and enhanced speech signals. Speech spectrograms presented in Fig. 2 use a Hanning window of 256 samples with 50% overlap and the noisy signal include car noise, SNR=5 dB. It is seen that the musical noise is removed for most part in Fig. 2 (e). Fig. 3 represents the variations of the a priori SNR estimated using the modified approach and the DD approach with the a posteriori SNR and the noisy signal includes Babble noise (SNR=5dB). It is observed that the

000633

Page 4: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

modified approach solves the delay problem while maintaining the advantages of the DD approach. In this paper we have used only objective tests for performance evaluation of the proposed approach. In future we will evaluate the subjective performances of the proposed approach.

(a)

(b)

(c)

(d)

(e)

Fig. 2 Speech Spectrograms. (a) Clean signal, (b) noisy signal, (c) enhanced signal using DD approach with Wiener gain, (c) enhanced signal using Modified approach with Wiener gain, and (d) enhanced signal using Modified approach with proposed sigmoidal gain.

Fig. 3 Variation of a priori SNR’s with a posteriori SNR. A posteriori SNR (bold solid line with point marker), a priori SNR of the DD approach (solid line) and a priori SNR of the modified approach (dotted line).

6. CONCLUSION

We have presented a sigmoidal shaped gain function for noise reduction with a modified a priori SNR estimator. The modified a priori estimator avoids the delay problem generated by DD approach and the proposed gain function applies heavy attenuation at low SNR level and little or no attenuation at high SNR thereby improving the performance of the speech enhancement system.

Table 1 Segmental SNR (SegSNR) of enhanced speech at various noisy environments.

000634

Page 5: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Table 2 Cepstral Distance (CD) of enhanced speech at various noisy environments.

REFERENCES

[1] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech”, Proc. IEEE, vol. 67, pp. 1586–1604, 1979. [2] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979. [3] R. J. McAulay and M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 28, pp. 137–145, 1980. [4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 32, pp. 1109–1121, 1984. [5] O. Cappe, “Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 1, pp. 345–349, April 1994. [6] Cyril Plapous, Claude Marro, Laurent Mauuary, and Pascal Scalart, “A Two Step Noise Reduction Technique,” IEEE Trans. Acoustic, Speech and Signal Processing, vol. 1, pp. 289–292, 2004.

[7] P. Scalart, and J. Vieira Filho, “Speech Enhancement Based on a Priori Signal to Noise Estimation,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc., pp. 629–632, 1996. [8] M.K. Hasan, S. Salahuddin, and M. R. Khan, “A Modified A priori SNR for Speech Enhancement Using Spectral Subtraction Rules”, IEEE Signal Processing Letters, vol. 11, no. 4, pp. 450-453, April 2004. [9] Israel Cohen, “Speech Enhancement using a Noncausal A Priori SNR Estimator”, IEEE Signal Processing Letters, vol. 11, no. 9, pp. 725-728, September 2004. [10] Shifeng Ou, Xiaohui Zhao and Ying Gao, “Speech Enhancement Employing Modified A Priori SNR Estimation”, Proc. of 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, pp. 827-831,July-August, 2007. [11] Yun-Sik PARK and Joon-Hyuk CHANG, “A novel Approach to a Robust A Priori SNR Estimator in Speech Enhancement ” IEICE Trans. Commun., vol. E90-B, no. 8, pp. 2182-2185, August, 2007. [12] J. H. L. Hansen and B. L. Pellom, “An effective evaluation protocol for speech enhancement algorithms,” in Proc. ICSLP, vol. 7, Sydney, Australia, 1998, pp. 2819–2822. [13] Aruna Bayya and Marvin Vis, “Objective Measures for Speech Quality Assessment in Wireless Communications”, ,” Proc. of IEEE International conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 495–498, 1996.

000635

Page 6: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Intentional Blank Page

000636