5
GAIN NORMALIZATION IN A 4200 BPS HOMOMORPHIC VOCODER Jae H. Chung and Ronald W. Schafer Georgia Institute of Technology School of Electrical Engineering Atlanta, Ga 30332 Abstract This paper describes a new technique for cod- ing the gains in a vector excitation homomorphic vocoder. In this system, the excitation signal, which is obtained by analysis-by-synthesis, consists of a part derived from a Gaussian codebook and a part de- rived from the past excitation. The paper shows how the correlation between the two gain parameters of the excitation can be increased and how they can be jointly coded at a lower bit-rate. This new approach makes it possible to reduce the bit rate of the ho- momorphic vocoder from 4800 bps to 4200 bps with essentially no degradation in speech quality. 1 Introduction In the original definition of the homomorphic vocoder, an estimate of the time-varying vocal tract impulse response was extracted using the homomorphic filtering procedure depicted in Figure 1.[1] The upper part of the figure depicts the operations required to compute the cepstrum h[n] of the vocal tract impulse response.[l][2]In Figure 1, v[n] is a window sequence (e.g., Hamming window) which selects a short segment of the speech signal for analysis, and l[n] is a “lifter” of the form (1) 2, 1 5 n < no (no = pitch period) 0, otherwise, l[n] = which extracts the low-time part of the cepstrum as a rep resentation of the vocal tract impulse response. The lower part of Figure 1 depicts the operations for computing the normalized (since &[O] = 0) vocal tract impulse response h[n]. The original homomorphic vocoder also used the cep- strum as the basis for a voiced/unvoiced decision and to estimate the pitch period for voiced speech.[l] At the syn- thesizer, depicted in Figure 2, an excitation sequence con- sisting of isolated impulses or random noise was created and this input was convolved with the estimated vocal tract im- pulse response to produce the synthetic speech output. The pitch period, amplitude of the excitation, and the low-time cepstrum values comprise a parametric representation of the speech signal that can be encoded for digital transmission or storage. The availability of increasingly powerful, inexpensive, DSP microcomputers has made it possible to consider much more sophisticated methods for obtaining the excitation sig- nal in vocoders. Multipulse[3], code excited[4], and self ex- cited or vector excitation[5] LPC vocoders have been widely studied. These same analysis-by-synthesis methods have also been applied successfully to derive the excitation for a homomorphic vocoder, at a bit rate of 4800 bps.[6] The performance of this 4800 bps vector-excited homomorphic vocoder is far superior to that of a pitch-excited homo- morphic vocoder and fully comparable to 4800 bps LPC vocoders using analysis-by-synthesis vector-excitation. This paper describes a new method of coding the two gain parameters of a vector excitation homomorphic vocoder. The approach involves a time varying gain normalization, which transforms the original uncorrelated gain parameters into highly correlated parameters that can be jointly quan- tized to achieve a significant reduction in bit rate over inde- pendent quantization of the two gain parameters. Using this technique, the bit-rate of the homomorphic vocoder can be reduced to 4200 bps with little or no degradation when com- pared to the 4800 bps vocoder with independently quantized gains. The paper is organized as follows: Section 2 gives a brief review of the analysis-by-synthesis method of obtaining the excitation signal; Section 3 introduces the gain normaliza- 0942 322.4.1. CH2829-0/90/0000-0942 $1 .OO 0 1990 IEEE

00117213

Embed Size (px)

DESCRIPTION

IEEE PAPER 00117213

Citation preview

  • GAIN NORMALIZATION IN A 4200 BPS HOMOMORPHIC VOCODER

    Jae H. Chung and Ronald W. Schafer

    Georgia Institute of Technology School of Electrical Engineering

    Atlanta, Ga 30332

    Abstract

    This paper describes a new technique for cod- ing the gains in a vector excitation homomorphic vocoder. In this system, the excitation signal, which is obtained by analysis-by-synthesis, consists of a part derived from a Gaussian codebook and a part de- rived from the past excitation. The paper shows how the correlation between the two gain parameters of the excitation can be increased and how they can be jointly coded at a lower bit-rate. This new approach makes it possible to reduce the bit rate of the ho- momorphic vocoder from 4800 bps to 4200 bps with essentially no degradation in speech quality.

    1 Introduction

    In the original definition of the homomorphic vocoder, an estimate of the time-varying vocal tract impulse response was extracted using the homomorphic filtering procedure depicted in Figure 1.[1] The upper part of the figure depicts the operations required to compute the cepstrum h[n] of the vocal tract impulse response.[l][2] In Figure 1, v[n] is a window sequence (e.g., Hamming window) which selects a short segment of the speech signal for analysis, and l[n] is a lifter of the form

    (1) 2, 1 5 n < no (no = pitch period) 0, otherwise,

    l[n] =

    which extracts the low-time part of the cepstrum as a r e p resentation of the vocal tract impulse response. The lower part of Figure 1 depicts the operations for computing the normalized (since &[O] = 0) vocal tract impulse response h[n]. The original homomorphic vocoder also used the cep- strum as the basis for a voiced/unvoiced decision and to estimate the pitch period for voiced speech.[l] At the syn-

    thesizer, depicted in Figure 2, an excitation sequence con- sisting of isolated impulses or random noise was created and this input was convolved with the estimated vocal tract im- pulse response to produce the synthetic speech output. The pitch period, amplitude of the excitation, and the low-time cepstrum values comprise a parametric representation of the speech signal that can be encoded for digital transmission or storage.

    The availability of increasingly powerful, inexpensive, DSP microcomputers has made it possible to consider much more sophisticated methods for obtaining the excitation sig- nal in vocoders. Multipulse[3], code excited[4], and self ex- cited or vector excitation[5] LPC vocoders have been widely

    studied. These same analysis-by-synthesis methods have also been applied successfully to derive the excitation for a homomorphic vocoder, at a bit rate of 4800 bps.[6] The performance of this 4800 bps vector-excited homomorphic vocoder is far superior to that of a pitch-excited homo- morphic vocoder and fully comparable to 4800 bps LPC vocoders using analysis-by-synthesis vector-excitation.

    This paper describes a new method of coding the two gain parameters of a vector excitation homomorphic vocoder. The approach involves a time varying gain normalization, which transforms the original uncorrelated gain parameters into highly correlated parameters that can be jointly quan- tized to achieve a significant reduction in bit rate over inde- pendent quantization of the two gain parameters. Using this technique, the bit-rate of the homomorphic vocoder can be reduced to 4200 bps with little or no degradation when com- pared to the 4800 bps vocoder with independently quantized gains.

    The paper is organized as follows: Section 2 gives a brief review of the analysis-by-synthesis method of obtaining the excitation signal; Section 3 introduces the gain normaliza-

    0942

    322.4.1. CH2829-0/90/0000-0942 $1 .OO 0 1990 IEEE

  • tion procedure; Section 4 describes a simple procedure for jointly quantizing the two gain parameters of the vector ex-

    citation, thereby reducing the bit rate from 4800 bps to 4200 bps; and Section 5 briefly summarizes some conclusions from the research.

    2 The Excitation Model

    Figure 3 shows a block diagram representation of the analysis- by-synthesis algorithm for determining the excitation signal e[n] for the homomorphic vocoder. The excitation model for a short excitation analysis frame (e.g., 5 msec or 40 samples at the 8 kHz sampling rate) is of the form

    and the corresponding perceptually weighted synthetic speech is

    4.1 = P1z1[.] + P Z Z Z [ ~ ] (3) where 21[n] = g[n] * f,,[n] and xz[n] = g[n] * e[n - 721, and g[n] = w[n] * h[n] is the perceptually weighted vocal tract impulse response. The excitation signal is composed of the following two parts: /31f7,,[n], where fY1[n] is a zero-mean Gaussian codebook sequence corresponding to index 71 in the codebook, and Pze[n - r2], which represents a short seg- ment of the past (previously computed) excitation beginning

    -y2 samples before the present excitation frame. Henceforth, p1 will be called the codebook gain and Pz will be called the self-ezcitation gain.

    First, the parameters 7 2 and PZ are chosen to minimize the mean-squared error

    (4) n

    For a given 7 2 , the value of P Z that minimizes the mean squared error in (4) is given by

    (5) n

    where zz[n] = g[n] * e[n - y2] = w[n] * h[n] * e[n - rz]. The optimum values for 72 and Pz are found by an exhaustive search with values of 7 2 restricted to a finite range. Then the residual signal yl[n] = y[n] - ,44~z[n] is formed and 71 and PI are chosen by an exhaustive search of the Gaussian codebook to minimize

    n

    As before, the value of error for a given codebook sequence f,,[n] is

    that minimizes the mean squared

    c Y ~ [ ~ l ~ ~ [ n l

    where q [ n ] = g[n] * f,, [n] = ~ [ n ] * h[n] * fYl[n]. Note that the effect of convolving w[n] with the original

    speech and with the impulse response before synthesis is effectively to multiply the Fourier transform of the error between the original speech and the synthetic speech by the magnitude squared of the frequency response corresponding to ~ [ n ] . When w[n] is properly chosen, the weighting has the effect of concentrating the coding noise in the formant regions of the spectral envelope, thereby making the coding error less perceptible.[7] In the homomorphic vocoder, an appropriate weighting filter can be derived directly from the cepstrum. [6]

    3 Gain Normalization

    Figures 4(a) and 4(b) show lP1l and as a function of the frame index. In this example, the system of Figure 1 wm used to compute 16 low-time cepstrum values with a vocal tract analysis frame spacing of 20 msec (160 samples). The

    cepstrum values were vector quantized using a codebook of 256 entries (&bits/cepstrum).[6] The system of Figure 3 was used to obtain the excitation signal using a 40-dimensional Gaussian codebook with 128 entries (41 represented by 7- bits) and with an excitation signal memory (72) ranging from 32 to 160 samples. The excitation frame length and spacing were both 5 msec (40 samples).

    with time are distinctly different, but the difference is easily understood. First note from (1) that the impulse responses h[n] are all normaliied automatically because L[O] = 0. Then recall that PZ is the constant multiplier of a portion of the previously computed excitation that contributes to the excitation in the current frame. Notice in Figure 4(b) that lPzl remains fairly con- stant near unity, except for large spikes and abrupt dips toward zero. This is to be expected, since in steady-state regions, the amplitude in an excitation analysis frame should be about the same aa the amplitude in previous frames in that steady-state region. However, in a transition frame

    The behaviors of lpll and

    322.4.2. 0943

  • from voiced to unvoiced, past excitation amplitudes will be much larger than required, and therefore Ipz I will have to be small to compensate. Likewise, in transitions from unvoiced to voiced, the immediate past excitation will be small, while a larger excitation will be required in the current frame. Therefore lpzl must be large to compensate.

    I tends to track the energy envelope of the speech signal and is somewhat better behaved. Clearly, the amplitude of the residual signal yl[n] will be proportional to the amplitude of the original speech signal. Therefore, since the codebook sequences all have the same energy, lpll will track the amplitude of y1 [n] and therefore also the amplitude of the input speech signal.

    Figure 4 shows that lpll and lpzl are not highly corre- lated, and therefore it would seem that there is little to be gained by jointly quantizing them. However, recall that lp2l is generally close to unity, and tends to follow the am- plitude of the speech signal and the excitation signal. This suggests that if lp2l is normalized by a function of the previ- ous excitation energy, then the correlation with can be greatly increased. Indeed, Figure 5(a) shows the parameter

    In contrast,

    aIP21, where

    Bits/Frame cepstrum I 81 82 I 71 72

    81 4 4 1 7 7

    with L representing the excitation frame length. That is, the gain normalizing factor a is the geometric mean of the energy of the excitation segment beginning at 72 and the energy of the just previous excitation frame. This averaging gives a smoothly varying normalizing factor which, as can be seen from Figure 5(a), converts Ipz I into a parameter that varies with time in much the same way that lpll varies.

    and aIp21 more clearly. Indeed, Figure 6 implies that lpll is propor- tional to alp2l to within a constant maximum percentage error. The straight line in Figure 6 is a least squares fit to the data which include 1536 frames from four different utterances by four different speakers. This linear fit to the log-log data is given by

    Figure 6 shows the correlation between

    Samples/Frame Bit Rate excitation I cepstrum (bits/sec)

    40 I 160 4800 I I I I I

    8 1 1 4 1 7 7 1 40 I 160 I 4200 I

    Table 1 Bit allocation for homomorphic vocoders.

    The 4800 bps homomorphic vocoder that was previously reported used the bit allocation scheme given in the first row of Table 1. In the 4800 bps vocoder, each of the two gain parameters was coded using a 3-bit APCM coder[8] to codelpll and Ipzl. The results of the previous section suggest that the total bit allocation can be reduced by jointly coding lpll and aIpzl. Clearly, many schemes can be found to take advantage of the correlation illustrated in Figure 6. One approach is simply to code alp21 using 3-bit APCM. (Figure 5(b) shows an example of this %bit quantization.) This information together with one bit each for the signs of p1 and p2 completes the representation for a total of five bits instead of eight. With an excitation frame rate of 200 frames/sec, this results in a reduction of 600 bps.

    At the receiver, lpzl is derived from the quantized ver- sion of alp21 by dividing by a, which is derivable using (8) from the past excitation. Then is obtained from the derived lpzl through (9). As an illustration, Figures 7(a) and 7(b) show the decoded lpll and lpzl respectively for the corresponding parameters in Figures 4(a) and 4(b).

    The performance of the 4200 bps homomorphic vocoder is virtually identical to the performance of the 4800 bps version. This is confirmed by careful listening tests and by the fact that over a range of speakers and utterances, the signal-to-noise ratio decreases by less than 0.4 dB in going from 4800 to 4200 bps.

    5 Conclusions

    which serves as an approximate relationship between the codebook gain and the normalized self-excited gain aIp21.

    4 Quantization for 4200 bps

    We have discussed a scheme for coding the codebook gain and the self-excitation gain in a vector-excitation homomor- phic vocoder. The proposed method permits significant re- duction of bit-rate for the homomorphic vocoder with virtu- ally no loss of quality. Furthermore, the method is applica-

    322.4.3. 0944

  • ble to any vocoder using the analysis-by-synthesis method of excitation analysis. Future work will consider other vari- ations on the normalization scheme as well as other methods of jointly quantizing the two gain parameters.

    44 -

    References

    +31 DISCRETE CONVOLIPTON

    [ 11 A. V. Oppenheim, Speech analysis-synthesis system based on homomorphic filtering, J. Acoust. Soc. Am., vo1.45, pp.458-465, Feb. 1969.

    [2] A. V. Oppenheim and R. W. Schafer, Homomorphic analysis of speech, IEEE Tram. on Audio and Elec- troacoustics, pp.118-123, June 1968.

    131 B. Atal and J. Remde, A new model of LPC excita- tion for producing natural-sounding speech at low bit rates, Proc. Intl. Conf. on Acoustics, Speech, and Sig- nal Processing, pp. 614-617, 1982.

    [4] M. R. Schroeder and B. Atal, Code-excited linear pre- diction (CELP): high-quality speech at very low bit rates, Proc. Intl. Conf. on Acoustics, Speech, and Sig- nal Processing, pp. 937-940, 1985.

    [5] R. Rose and T. P. Barnwell, 111, Quality comparison of low complexity 4800 bps self excited and code excited vocoders, Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 1637-1640, 1987.

    [6] J. H. Chung and R. W. Schafer, A 4.8 Kbps homo- morphic vocoder using analysis-by-synthesis excitation analysis, Intl. Conf. on Acoustics, Speech, and Signal Proc., pp. 144-147, 1989.

    [7] B. S. Atal and M. R. Schoeder, Predictive coding of speech signals and subjective error criteria, IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP- 27, no. 3, pp. 247-254, June, 1979.

    [S] N. S. Jayant, UAdaptive quantization with a one word memory, Bell System Tech. J. , pp. 1119-1144, Septem- ber, 1973.

    Figure 1 Homomorphic filtering for estimating the vocal tract impulse response.

    h [.I

    Figure 2 Synthesizer for homomorphic vocoder.

    ERROR ZATION Weighted Error

    Figure 3 Analysis-by-synthesis method for obtaining the excitation sequence e[n] .

    322.4.4. 0945

  • IOW (a) UNOUANTIZBD CODEBOOK GAIN

    800

    loo0

    800

    600

    400

    200

    '0 SO 100 150 200 250 300 350 400

    8

    @) UNOUANTIZED SELF-EXCITATION GAIN 8

    J 4

    2

    0 0 50 100 150 200 250 300 350 400

    .. ... . 103 :

    10' E

    100 10' 102 103 104

    excitation frame index NO-D SELF-EXCITATION GAIN

    Figure 4 (a) Codebook gain I/311. (b) Self-excitation gain Figure 6 Ilustration of correlation between lpll and aIPz1. IPZ I.

    1 600

    400

    200 0.5

    0 0 0 SO 100 150 200 250 300 350 400

    6

    4

    2

    1

    0.5

    0 0 0 SO 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400

    excitation frame index excitation frame index

    Figure 5 aIP11. (b) Quantized normalized self-excitation gain,

    (a) Unquantized normalized Self-excitation gain Figure 7 self-excitation gain.

    (a) Quantized codebook gain (b) Quantized

    322.4.5. 0946