00117213

GAIN NORMALIZATION IN A 4200 BPS HOMOMORPHIC VOCODER

Jae H. Chung and Ronald W. Schafer

Georgia Institute of Technology School of Electrical Engineering

Atlanta, Ga 30332

Abstract

This paper describes a new technique for coding the gains in a vector excitation homomorphic vocoder. In this system, the excitation signal, which is obtained by analysis-by-synthesis, consists of a part derived from a Gaussian codebook and a part derived from the past excitation. The paper shows how the correlation between the two gain parameters of the excitation can be increased and how they can be jointly coded at a lower bit-rate. This new approach makes it possible to reduce the bit rate of the homomorphic vocoder from 4800 bps to 4200 bps with essentially no degradation in speech quality.

1 Introduction

In the original definition of the homomorphic vocoder, an estimate of the time-varying vocal tract impulse response was extracted using the homomorphic filtering procedure depicted in Figure 1.[1] The upper part of the figure depicts the operations required to compute the cepstrum h[n] of the vocal tract impulse response.[l][2] In Figure 1, v[n] is a window sequence (e.g., Hamming window) which selects a short segment of the speech signal for analysis, and l[n] is a lifter of the form

(1) 2, 1 5 n < no (no = pitch period) 0, otherwise,

l[n] =

which extracts the low-time part of the cepstrum as a r e p resentation of the vocal tract impulse response. The lower part of Figure 1 depicts the operations for computing the normalized (since &[O] = 0) vocal tract impulse response h[n]. The original homomorphic vocoder also used the cepstrum as the basis for a voiced/unvoiced decision and to estimate the pitch period for voiced speech.[l] At the syn-

thesizer, depicted in Figure 2, an excitation sequence con- sisting of isolated impulses or random noise was created and this input was convolved with the estimated vocal tract impulse response to produce the synthetic speech output. The pitch period, amplitude of the excitation, and the low-time cepstrum values comprise a parametric representation of the speech signal that can be encoded for digital transmission or storage.

The availability of increasingly powerful, inexpensive, DSP microcomputers has made it possible to consider much more sophisticated methods for obtaining the excitation signal in vocoders. Multipulse[3], code excited[4], and self excited or vector excitation[5] LPC vocoders have been widely

studied. These same analysis-by-synthesis methods have also been applied successfully to derive the excitation for a homomorphic vocoder, at a bit rate of 4800 bps.[6] The performance of this 4800 bps vector-excited homomorphic vocoder is far superior to that of a pitch-excited homomorphic vocoder and fully comparable to 4800 bps LPC vocoders using analysis-by-synthesis vector-excitation.

This paper describes a new method of coding the two gain parameters of a vector excitation homomorphic vocoder. The approach involves a time varying gain normalization, which transforms the original uncorrelated gain parameters into highly correlated parameters that can be jointly quantized to achieve a significant reduction in bit rate over inde- pendent quantization of the two gain parameters. Using this technique, the bit-rate of the homomorphic vocoder can be reduced to 4200 bps with little or no degradation when com- pared to the 4800 bps vocoder with independently quantized gains.

The paper is organized as follows: Section 2 gives a brief review of the analysis-by-synthesis method of obtaining the excitation signal; Section 3 introduces the gain normaliza-

0942

322.4.1. CH2829-0/90/0000-0942 $1 .OO 0 1990 IEEE

tion procedure; Section 4 describes a simple procedure for jointly quantizing the two gain parameters of the vector ex-

citation, thereby reducing the bit rate from 4800 bps to 4200 bps; and Section 5 briefly summarizes some conclusions from the research.

2 The Excitation Model

Figure 3 shows a block diagram representation of the analysis- by-synthesis algorithm for determining the excitation signal e[n] for the homomorphic vocoder. The excitation model for a short excitation analysis frame (e.g., 5 msec or 40 samples at the 8 kHz sampling rate) is of the form

and the corresponding perceptually weighted synthetic speech is

4.1 = P1z1[.] + P Z Z Z [ ~ ] (3) where 21[n] = g[n] * f,,[n] and xz[n] = g[n] * e[n - 721, and g[n] = w[n] * h[n] is the perceptually weighted vocal tract impulse response. The excitation signal is composed of the following two parts: /31f7,,[n], where fY1[n] is a zero-mean Gaussian codebook sequence corresponding to index 71 in the codebook, and Pze[n - r2], which represents a short segment of the past (previously computed) excitation beginning

-y2 samples before the present excitation frame. Henceforth, p1 will be called the codebook gain and Pz will be called the self-ezcitation gain.

First, the parameters 7 2 and PZ are chosen to minimize the mean-squared error

(4) n

For a given 7 2 , the value of P Z that minimizes the mean squared error in (4) is given by

(5) n

where zz[n] = g[n] * e[n - y2] = w[n] * h[n] * e[n - rz]. The optimum values for 72 and Pz are found by an exhaustive search with values of 7 2 restricted to a finite range. Then the residual signal yl[n] = y[n] - ,44~z[n] is formed and 71 and PI are chosen by an exhaustive search of the Gaussian codebook to minimize

n

As before, the value of error for a given codebook sequence f,,[n] is

that minimizes the mean squared

c Y ~ [ ~ l ~ ~ [ n l

where q [ n ] = g[n] * f,, [n] = ~ [ n ] * h[n] * fYl[n]. Note that the effect of convolving w[n] with the original

speech and with the impulse response before synthesis is effectively to multiply the Fourier transform of the error between the original speech and the synthetic speech by the magnitude squared of the frequency response corresponding to ~ [ n ] . When w[n] is properly chosen, the weighting has the effect of concentrating the coding noise in the formant regions of the spectral envelope, thereby making the coding error less perceptible.[7] In the homomorphic vocoder, an appropriate weighting filter can be derived directly from the cepstrum. [6]

3 Gain Normalization

Figures 4(a) and 4(b) show lP1l and as a function of the frame index. In this example, the system of Figure 1 wm used to compute 16 low-time cepstrum values with a vocal tract analysis frame spacing of 20 msec (160 samples). The

cepstrum values were vector quantized using a codebook of 256 entries (&bits/cepstrum).[6] The system of Figure 3 was used to obtain the excitation signal using a 40-dimensional Gaussian codebook with 128 entries (41 represented by 7- bits) and with an excitation signal memory (72) ranging from 32 to 160 samples. The excitation frame length and spacing were both 5 msec (40 samples).

with time are distinctly different, but the difference is easily understood. First note from (1) that the impulse responses h[n] are all normaliied automatically because L[O] = 0. Then recall that PZ is the constant multiplier of a portion of the previously computed excitation that contributes to the excitation in the current frame. Notice in Figure 4(b) that lPzl remains fairly constant near unity, except for large spikes and abrupt dips toward zero. This is to be expected, since in steady-state regions, the amplitude in an excitation analysis frame should be about the same aa the amplitude in previous frames in that steady-state region. However, in a transition frame

The behaviors of lpll and

322.4.2. 0943

from voiced to unvoiced, past excitation amplitudes will be much larger than required, and therefore Ipz I will have to be small to compensate. Likewise, in transitions from unvoiced to voiced, the immediate past excitation will be small, while a larger excitation will be required in the current frame. Therefore lpzl must be large to compensate.

I tends to track the energy envelope of the speech signal and is somewhat better behaved. Clearly, the amplitude of the residual signal yl[n] will be proportional to the amplitude of the original speech signal. Therefore, since the codebook sequences all have the same energy, lpll will track the amplitude of y1 [n] and therefore also the amplitude of the input speech signal.

Figure 4 shows that lpll and lpzl are not highly correlated, and therefore it would seem that there is little to be gained by jointly quantizing them. However, recall that lp2l is generally close to unity, and tends to follow the amplitude of the speech signal and the excitation signal. This suggests that if lp2l is normalized by a function of the previous excitation energy, then the correlation with can be greatly increased. Indeed, Figure 5(a) shows the parameter

In contrast,

aIP21, where

Bits/Frame cepstrum I 81 82 I 71 72

81 4 4 1 7 7

with L representing the excitation frame length. That is, the gain normalizing factor a is the geometric mean of the energy of the excitation segment beginning at 72 and the energy of the just previous excitation frame. This averaging gives a smoothly varying normalizing factor which, as can be seen from Figure 5(a), converts Ipz I into a parameter that varies with time in much the same way that lpll varies.

and aIp21 more clearly. Indeed, Figure 6 implies that lpll is proportional to alp2l to within a constant maximum percentage error. The straight line in Figure 6 is a least squares fit to the data which include 1536 frames from four different utterances by four different speakers. This linear fit to the log-log data is given by

Figure 6 shows the correlation between

Samples/Frame Bit Rate excitation I cepstrum (bits/sec)

40 I 160 4800 I I I I I

8 1 1 4 1 7 7 1 40 I 160 I 4200 I

Table 1 Bit allocation for homomorphic vocoders.

The 4800 bps homomorphic vocoder that was previously reported used the bit allocation scheme given in the first row of Table 1. In the 4800 bps vocoder, each of the two gain parameters was coded using a 3-bit APCM coder[8] to codelpll and Ipzl. The results of the previous section suggest that the total bit allocation can be reduced by jointly coding lpll and aIpzl. Clearly, many schemes can be found to take advantage of the correlation illustrated in Figure 6. One approach is simply to code alp21 using 3-bit APCM. (Figure 5(b) shows an example of this %bit quantization.) This information together with one bit each for the signs of p1 and p2 completes the representation for a total of five bits instead of eight. With an excitation frame rate of 200 frames/sec, this results in a reduction of 600 bps.

At the receiver, lpzl is derived from the quantized version of alp21 by dividing by a, which is derivable using (8) from the past excitation. Then is obtained from the derived lpzl through (9). As an illustration, Figures 7(a) and 7(b) show the decoded lpll and lpzl respectively for the corresponding parameters in Figures 4(a) and 4(b).

The performance of the 4200 bps homomorphic vocoder is virtually identical to the performance of the 4800 bps version. This is confirmed by careful listening tests and by the fact that over a range of speakers and utterances, the signal-to-noise ratio decreases by less than 0.4 dB in going from 4800 to 4200 bps.

5 Conclusions

which serves as an approximate relationship between the codebook gain and the normalized self-excited gain aIp21.

4 Quantization for 4200 bps

We have discussed a scheme for coding the codebook gain and the self-excitation gain in a vector-excitation homomorphic vocoder. The proposed method permits significant reduction of bit-rate for the homomorphic vocoder with virtually no loss of quality. Furthermore, the method is applica-

322.4.3. 0944

ble to any vocoder using the analysis-by-synthesis method of excitation analysis. Future work will consider other vari- ations on the normalization scheme as well as other methods of jointly quantizing the two gain parameters.

44 -

References

+31 DISCRETE CONVOLIPTON

[ 11 A. V. Oppenheim, Speech analysis-synthesis system based on homomorphic filtering, J. Acoust. Soc. Am., vo1.45, pp.458-465, Feb. 1969.

[2] A. V. Oppenheim and R. W. Schafer, Homomorphic analysis of speech, IEEE Tram. on Audio and Elec- troacoustics, pp.118-123, June 1968.

131 B. Atal and J. Remde, A new model of LPC excitation for producing natural-sounding speech at low bit rates, Proc. Intl. Conf. on Acoustics, Speech, and Sig- nal Processing, pp. 614-617, 1982.

[4] M. R. Schroeder and B. Atal, Code-excited linear pre- diction (CELP): high-quality speech at very low bit rates, Proc. Intl. Conf. on Acoustics, Speech, and Sig- nal Processing, pp. 937-940, 1985.

[5] R. Rose and T. P. Barnwell, 111, Quality comparison of low complexity 4800 bps self excited and code excited vocoders, Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 1637-1640, 1987.

[6] J. H. Chung and R. W. Schafer, A 4.8 Kbps homomorphic vocoder using analysis-by-synthesis excitation analysis, Intl. Conf. on Acoustics, Speech, and Signal Proc., pp. 144-147, 1989.

[7] B. S. Atal and M. R. Schoeder, Predictive coding of speech signals and subjective error criteria, IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP- 27, no. 3, pp. 247-254, June, 1979.

[S] N. S. Jayant, UAdaptive quantization with a one word memory, Bell System Tech. J. , pp. 1119-1144, Septem- ber, 1973.

Figure 1 Homomorphic filtering for estimating the vocal tract impulse response.

h [.I

Figure 2 Synthesizer for homomorphic vocoder.

ERROR ZATION Weighted Error

Figure 3 Analysis-by-synthesis method for obtaining the excitation sequence e[n] .

322.4.4. 0945

IOW (a) UNOUANTIZBD CODEBOOK GAIN

800

loo0

800

600

400

200

'0 SO 100 150 200 250 300 350 400

8

@) UNOUANTIZED SELF-EXCITATION GAIN 8

J 4

2

0 0 50 100 150 200 250 300 350 400

.. ... . 103 :

10' E

100 10' 102 103 104

excitation frame index NO-D SELF-EXCITATION GAIN

Figure 4 (a) Codebook gain I/311. (b) Self-excitation gain Figure 6 Ilustration of correlation between lpll and aIPz1. IPZ I.

1 600

400

200 0.5

0 0 0 SO 100 150 200 250 300 350 400

6

4

2

1

0.5

0 0 0 SO 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400

excitation frame index excitation frame index

Figure 5 aIP11. (b) Quantized normalized self-excitation gain,

(a) Unquantized normalized Self-excitation gain Figure 7 self-excitation gain.

(a) Quantized codebook gain (b) Quantized

322.4.5. 0946

Documents

00117213