4
Proceedings of 2006 IEEE Information Theory Workshop (ITW'06) DESIGN AND DESCRIPTION OF A 600 BPS SPEECH CODER BASED ON MELPE Feng Zou Ying Guo Xinfu Chen Yan Liu Telecommunication Engineering College, Air Force Engineering University, Xi'an, Shaanxi 710077, China Email: feng-zou dl63.com Abstract-This paper describes a 600 bps speech coder based on the enhanced mixed excitation linear prediction model Ill (MELPe). The algorithm of this speech coder includes features of MELPe, which can obtain high quality synthesized speech and is robust in difficult background noise environments. To reduce the bit rate, we have developed a modified multi-frame joint vector quantization that takes advantage of inherent inter-frame redundancy. The predicted multi-stage vector quantizer (PMSVQ) is designed to quantize the line spectrum frequency (LSF) parameters. Simulation results have proven that an efficient and high quality coding has been achieved at bit rate 600 bps, and the proposed coder is better than the existing 2400 bps LPClOe standard [2]. I. INTRODUCTION The Mixed Excitation Linear Predictive [3] (MELP) vocoder was selected as the 2400 bps Federal Standard Vocoder in 1996. The United States Department of Defense Digital Voice Processing Consortium (DDVPC) had taken a multi-year extensive testing program. MELP was selected as the best of seven candidates and even beat the FS1016 4800 bps vocoder. MELPe provides a 1200 bps option and speech enhancement. It had been adopted as the STANAG4591 2400/1200 bps vocoder by North Atlantic Treaty Organization (NATO). With the optional noise pre-processor, MELPe is robust in difficult background noise environments such as those frequently encountered in commercial and military communication systems. In this paper, we describe the important aspects of the algorithm, which is used in proposed 600 bps speech coder. The core analysis algorithm is shared with the 2400 bps MELPe standard, and the transmitted parameters are the part of the 2400 bps MELPe coder. The parameters of three consecutive frames are grouped together into a superframe. Proposed coder uses the modified multi-frame joint vector quantization to quantize the parameters of superframe. The predicted multi-stage vector quantizer is designed because of the LSF correlation between frames to frames. An MA prediction [4] is used to quantize the LSF parameters. The MA prediction is effective against channel errors because the propagation of decoding errors is limited in the order of the prediction. The PMSVQ achieves "transparent quality" while the computation complexity is low, and the propagation of the errors is limited. II. CODER OVERVIEW The proposed coder is designed to operate with an appropriately band-limited signal sampled at 8000 Hz. The input and output samples are represented using 16-b linear PCM. The coder operates on frames of 25 ms, using three consecutive frames are grouped together into a superframe. This results in an overall algorithmic delay of 75 ms. The analysis and synthesis algorithm of proposed coder is shared with the 2400 bps MELPe. There are six kinds of parameters need to be transmitted in the MELPe vocoder. We select band-pass voicing, energy, pitch and spectrum to be quantized and transmitted. No bits are used to perform the others quantization. The Fourier magnitude vector is quantized to one of two vectors. A flat vector is selected for unvoiced frames, and a single vector is used for voiced frames. The vector which is selected depends on the voiced/unvoiced decision. The aperiodic flag can also be achieved from the voiced/unvoiced decisions of the superframe, because aperiodic pulses are used most often during transition regions between voiced and unvoiced segments of the speech signal. So no bits are required to perform the quantization of Fourier magnitude and aperiodic flag. The selected parameters will be quantized jointly. Noise pre-processor, adaptive spectral enhancement, and pulse dispersion are used to obtain high quality synthesized speech. III. MODIFIED MULTI-FRAME JOINT VECTOR QUANTIZATION A. Multi-frame LSFjoint quantization based on PMSVQ The linear prediction coefficients are converted into line spectrum frequency, and the LSF parameters of three consecutive frames are grouped together into a matrix. We can exploit the redundancy arising from the correlation between consecutive matrixes, so that a predicted multi-stage vector quantizer is designed to quantize the LSF parameters. The LSF residue of the prior superframe will be used to predict the LSF coefficients of the current superframe. Input LSF parameters are predicted by using a second order MA prediction and the residue of the predicted LSP 1-4244-0067-8/06/$20.00 C)2006 IEEE. 356

04119318

Embed Size (px)

Citation preview

Page 1: 04119318

Proceedings of 2006 IEEE Information Theory Workshop (ITW'06)

DESIGN AND DESCRIPTION OF A 600 BPSSPEECH CODER BASED ON MELPE

Feng Zou Ying Guo Xinfu Chen Yan LiuTelecommunication Engineering College, Air Force Engineering University, Xi'an, Shaanxi 710077, China

Email: feng-zou dl63.com

Abstract-This paper describes a 600 bps speech coder basedon the enhanced mixed excitation linear prediction model Ill

(MELPe). The algorithm of this speech coder includesfeatures of MELPe, which can obtain high qualitysynthesized speech and is robust in difficult backgroundnoise environments. To reduce the bit rate, we havedeveloped a modified multi-frame joint vector quantizationthat takes advantage of inherent inter-frame redundancy.The predicted multi-stage vector quantizer (PMSVQ) isdesigned to quantize the line spectrum frequency (LSF)parameters. Simulation results have proven that an efficientand high quality coding has been achieved at bit rate 600bps, and the proposed coder is better than the existing 2400bps LPClOe standard [2].

I. INTRODUCTION

The Mixed Excitation Linear Predictive [3] (MELP)vocoder was selected as the 2400 bps Federal StandardVocoder in 1996. The United States Department ofDefense Digital Voice Processing Consortium (DDVPC)had taken a multi-year extensive testing program. MELPwas selected as the best of seven candidates and even beatthe FS1016 4800 bps vocoder. MELPe provides a 1200bps option and speech enhancement. It had been adoptedas the STANAG4591 2400/1200 bps vocoder by NorthAtlantic Treaty Organization (NATO). With the optionalnoise pre-processor, MELPe is robust in difficultbackground noise environments such as those frequentlyencountered in commercial and military communicationsystems.

In this paper, we describe the important aspects of thealgorithm, which is used in proposed 600 bps speechcoder. The core analysis algorithm is shared with the2400 bps MELPe standard, and the transmittedparameters are the part of the 2400 bps MELPe coder.The parameters of three consecutive frames are groupedtogether into a superframe. Proposed coder uses themodified multi-frame joint vector quantization to quantizethe parameters of superframe. The predicted multi-stagevector quantizer is designed because of the LSFcorrelation between frames to frames. An MA prediction[4] is used to quantize the LSF parameters. The MAprediction is effective against channel errors because thepropagation of decoding errors is limited in the order ofthe prediction. The PMSVQ achieves "transparentquality" while the computation complexity is low, and thepropagation of the errors is limited.

II. CODER OVERVIEW

The proposed coder is designed to operate with anappropriately band-limited signal sampled at 8000 Hz.The input and output samples are represented using 16-blinear PCM. The coder operates on frames of 25 ms,using three consecutive frames are grouped together intoa superframe. This results in an overall algorithmic delayof 75 ms.The analysis and synthesis algorithm of proposed coder

is shared with the 2400 bps MELPe. There are six kindsof parameters need to be transmitted in the MELPevocoder. We select band-pass voicing, energy, pitch andspectrum to be quantized and transmitted. No bits areused to perform the others quantization. The Fouriermagnitude vector is quantized to one of two vectors. Aflat vector is selected for unvoiced frames, and a singlevector is used for voiced frames. The vector which isselected depends on the voiced/unvoiced decision. Theaperiodic flag can also be achieved from thevoiced/unvoiced decisions of the superframe, becauseaperiodic pulses are used most often during transitionregions between voiced and unvoiced segments of thespeech signal. So no bits are required to perform thequantization of Fourier magnitude and aperiodic flag. Theselected parameters will be quantized jointly. Noisepre-processor, adaptive spectral enhancement, and pulsedispersion are used to obtain high quality synthesizedspeech.

III. MODIFIED MULTI-FRAME JOINT VECTORQUANTIZATION

A. Multi-frame LSFjoint quantization based on PMSVQ

The linear prediction coefficients are converted intoline spectrum frequency, and the LSF parameters of threeconsecutive frames are grouped together into a matrix.We can exploit the redundancy arising from thecorrelation between consecutive matrixes, so that apredicted multi-stage vector quantizer is designed toquantize the LSF parameters. The LSF residue of theprior superframe will be used to predict the LSFcoefficients of the current superframe. Input LSFparameters are predicted by using a second order MAprediction and the residue of the predicted LSP

1-4244-0067-8/06/$20.00 C)2006 IEEE. 356

Page 2: 04119318

parameters is quantized by a four-stage VQ [5] Fig. 1shows the PMSVQ scheme.

codebook r WP f

F£_1d L PL J -1, 4-

Codebook mn-iumindex

Figure 1. The scheme ofPMSVQ

Input LSF parameters are Wj . Quantized LSF

parameters (i j are generated using:

i,j = o,j rij + Pl,j -

i-1,3 + P2,j r'i-1,2=k,j- diag{pp,j j p;jJ}M

Z pkj =I; j=1,2,3;k =O, M;M =2k=O

(1)

(2)

(3)

Where i is the i-th superframe, j is the j-th frame of thesuperframe, M is the MA prediction order, ri is the

output vector from the four-stage VQ at the j-th frame ofi-th superframe, I is the unit matrix, and Pk j is a

diagonal prediction matrix.The generalized Lloyd algorithm [6] is used to train the

MA predictive coefficients. First, the algorithm generatesthe code from the LSF codebook that minimizes thedistortion for the input LSF parameters for each frame.Second, it determines MA predictive coefficients thatminimize the distortion between input parameters and thereconstructed parameters for all frames. The twoprocesses are performed alternately. While MA predictivecoefficients are being trained, the codebooks are keptfixed. Spectral distortion is selected as the distortionmeasure.

SD [fio sigicoj ]1t (4)

Where i is the i-th superframe, j is the j-th frame of the

superframe, and Sij (w) and Si, (o) are the power

spectrum of unquantized and quantized signal.The MSVQ codebook consists of four stages of 128,

128, 64, and 64 levels respectively. The search procedureis an M-best [7] approximation to a full search, in whichthe M=8 best code vectors from each stage are saved foruse with the next stage, and uses spectral distortion as thedistortion measure also.

TABLE I. LSF QUANTIZER PERFORMANCE BASED ON PMSVQ

Average SD (dB) 4>SD>2 SD>4

1.24 7.87%0 0.1%o

Table 1 shows the performance of the PMSVQ. It canachieve "transparent quality" approximately, and onlyuses 26-bit codebook to quantize. The concept of"Transparent quantization" was described in reference [8].Any degradation caused by channel errors affects thequality of only a few of the subsequent superframeswhich is determined by the order ofMA prediction.

B. Multi-frame pitch quantization

The pitch information of superframe is quantized used7-bit codebook. The quantization schemes of pitch aredetermined by the different voiced/unvoiced decisions ofthe superframe. Pitch information is not to be quantized,where all the frames are unvoiced in a superframe. Forsuperframe that contains only one voiced frame, the pitchvalue of voiced frame is quantized on a logarithmic scalewith a 99-level uniform quantizer which is the same asthat in the 2400 bps MELP standard. The unused bits areused to the error protection. Within the superframe wherethe voiced frames are two or three, the pitch parametersare vector quantized. A special distortion measure is usedin this VQ algorithm which is additional detailed inreference [9]. The distortion measure is showed asfollow:

d =wEp_ |-i+A A1312i=l i=l

1, voiced frame{=0. 1, unvoiced frame

Ap =pi - Pi-1, voiced frames0, otherwise

(5)

(6)

(7)

Where pi and Pi are the unquantized and quantized log

pitch values respectively, po is the last log pitch value

of the previous superframe, wi is the weighting

coefficient, 3 is a parameter to control the contributionof pitch differentials which is set to be 1 in the proposedcoder. The optimum index is selected from codebook thatminimizes the distortion.

C. Multi-frame band-pass voicing quantization

The proposed coder determines the five band-passvoiced/unvoiced decisions per frame, and uses a 3-bitcodebook to quantize per superframe by taking advantageof inter-frame redundancy of the voicing decisions. Theband-pass voiced/unvoiced decisions parameters of threeconsecutive frames are grouped together into a vector.

1-4244-0067-8/06/$20.00 C)2006 IEEE. 357

Page 3: 04119318

The VQ algorithm uses weighted Euclidean distance asthe distortion measure.

d= yw1(b,j, ) (8)i=1 j=1

Where i is the i-th frame of the current superframe, j is thej-th band-pass of the current frame, bi j = 1 means that

the j-th band-pass voiced/unvoiced decisions is voiced,

otherwise bij = 0, bij is the quantized band-pass

voiced/unvoiced decision, and Wj is the weighted factor

which is determined by training.

D. Multi-frame gain quantization

Two gain parameters are calculated per frame, and thelogarithmic energy values from three successive framesare grouped to form vectors of 6 dimensions. An 8-bitcodebook is used quantized the vectors. The gaincodebook was generated using the K-means vectorquantization algorithm, and the Euclidean distance isadopted as the distortion measure.

E. Bit allocation

The proposed coder operates on frames of 25 ms, andthe block buffer of three consecutive frames, for blockduration of 75 ms. The bit allocation of proposed coderare shown in Table 2. A total of 45 bits is used persuperframe.

TABLE II. THE PROPOSED CODER BIT ALLOCATION

Parameters Bits (bit)

Pitch 7

Gain 8

Fourier Magnitudes 0

Band-pass Voicing 3

Aperiodic Flag 0

LSF 26

Synchronization 1

Total 45

IV. TEST RESULT

The Diagnostic Rhyme Test (DRT) and the DiagnosticAcceptability Measure (DAM) are used in these informaltests. DRT is used to measure speech intelligibility, andspeech quality is measured by DAM. For comparisonpurposes, the 2.4 kbps MELPe standard coder P] was used.The coders were tested on speech containing quietbackground, 1% random bit error channel, and high

mobility multipurpose wheeled vehicle (HMMWV)background. All of the coders scored higher for maletalkers than female talkers, and the averaged results ofmale and female scores are shown in Table 3 to 5. Thesubjective quality of the proposed coder is found betterthan that of LPClOe [2] and approximately near the 2.4kbps MELP standard [3] in informal tests.

TABLE III. INFORMAL TEST RESULTS IN QUIET BACKGROUND

Test item DRT DAM

Test Speech Signal (Quiet) 96.1 86.0

2400 bps MELPe 94.2 70.1

600 bps proposed coder 91.3 56.7

TABLE IV. INFORMAL TEST RESULTS IN 1% BRE

Test item DRT DAM

Test Speech Signal (Quiet) 96.1 86.0

2400 bps MELPe 92.2 59.5

600 bps proposed coder 86.1 45.7

TABLE V. INFORMAL TEST RESULTS INHMMWV BACKGROUND

Test item DRT DAM

Test Speech Signal (HMMWV) 92.0 50.3

2400 bps MELPe 76.2 54.6

600 bps proposed coder 68.4 42.5

V. CONCLUSION

In this paper, a new 600 bps speech coder based onMELPe is proposed, and the important aspects of thealgorithm are described. The proposed coder uses newtechniques for improving performance. The PMSVQ isdesigned to quantize the LSF parameters, which achieves"transparent quality" approximately and against channelerrors effective. Using the modified multi-frame jointvector quantization to quantize the parameters ofsuperframe, we can reduce the bit rate and obtain highquality synthesized speech. The informal subjectivequality tests show that the speech quality of proposedcoder is found better than that of LPC 1 Oe andapproximately near the 2.4 kbps MELP standard.

ACKNOWLEDGMENTS

The research is supported by: Shaanxi Natural ScienceFoundation of China (No. 2006F40).

1-4244-0067-8/06/$20.00 C2006 IEEE. 358

Page 4: 04119318

REFERENCES

[1] J. S. Collura, and D. F. Brandt, "The 1.2kps/2.4kbpsMELP speech coding suite with integrated noisepre-processing," in Proc. IEEE Mil. Comm. Atlantic City,NJ, vol. 2, pp. 1449-1453, Oct.-Nov. 1999.[2] T.E. Tremain, "The government standard linearpredictive coding algorithm: LPC- 10," SpeechTechnology, vol. 2, no. 1, pp. 40-49, April. 1982.[3] McCree A, and Truong K, "A 2.4 kbit/s MELP codercandidate for the new U.S. federal standard," Proceedingsof IEEE ICASSP 1996. Piscataway, New Jersey. IEEEPress, pp. 200-203, 1996.[4] R. Salami, and C. Laflamme, "Design and descriptionof CS-ACELP: A toll quality 8kb/s speech coder," IEEETransactions on Speech andAudio Processing, vol. 6, no.2,pp. 116-130, March. 1998.[5] Chan W Y, and Gupta S, "Enhanced multistage vectorquantization by joint codebook design," IEEETransactions on Communications, vol. 40, no. 11, pp.1693-1697, 1992.[6] S. P. Lloyd, "Least squares quantization in PCM,"IEEE Trans. Inform. Theory, vol. 28, no. 2, pp. 129-137,1982.[7] LeBlanc W P, and Bhattacharya B, "Efficient searchand design procedures for robust multi-stage VQ of LPCparameters for 4 kb/s speech coding," IEEE Transactionson Speech and Audio Processing, vol. 1, no. 4, pp.373-385, 1993.[8] K. K. Paliwal and B. S. Atal, "Efficient VectorQuantization of LPC Parameters at 24 Bits/Frame," IEEETransactions on Speech and Audio Processing, vol. 1, no.1, pp. 3-14, Jan. 1993.[9] Wang Tian, and Koishida K, "A 1200 bps speechcoder based on MELP," IEEE. ICASSP 2000. Piscataway,New Jersey. IEEE Press, pp. 1375-1378, 2000.

1-4244-0067-8/06/$20.00 C)2006 IEEE. 359