00226106

PERFORMANCE EVALUATION OF ANALYSIS-BY-SYNTHESIS HOMOMORPHIC VOCODERS

Jae H . Chung AT&T Bell Laboratories

200 Park Plaza Naperville, Illinois 60566

and

ABSTRACT Previous research [l] [2] has shown that the homommphic filming procdure combmed with analysis-by-synthesis excitation coding is a promising alternative to LPC for low bit rate vocoding. In particular, static vector excited and dynamic vector excited homomorphic vocodQB have been designed. In this paper, performance of recently developed homommphic vocoders is evaluated through formal subjective listening tests, using a variation of the Paired Acceptability Rating Method. The subjective test results show that the new vocoder framework using analysis-by-synthesis excitation analysis is capable of producing good speech quality at 4.8 Kbps or lower.

1. INTRODUCI'ION In the homomorphic vocoder, the time-varying vocal tract

information is represented by the low-time cepstrum, which is obtained using the homomorphic filtering p r d u r e illustrated in Figure 1. For each frame of the pre-emphasized speech signal s[n ] . the cmrespox$ing ceptrum c[n] is computed. The low-time cepstrum h [ n ] representing the vocal tract information is then extracted from the cepsmm c [n]. Finally, the normalized imqulse response h[n] is derived from the low-time cepstrum h [n] .

Recently a new vocoder framework has been proposed based on the homomorphic filtering procedure combined with the analysis-by-synthesis excitation coding procedure [l] [2]. In this vocoder framework, the excitation signal e[n] is determined blockwise using the analysis-by-synthesis excitation coding algorithm which was originally proposed and applied to the LPC vocal tract model by Atal et d [ 3 ] [4] to determine its excitation signal. The analysis-by-synthesis method for obtaining the excitation sequence e [n] is illustrated in Figure 2.

In this paper, performance of recently developed homomorphic vocoders is evaluated through formal subjective listening tests, using a variation of the Paired Acceptability Rating Method. In particular, the subjective tests include static vector excited [l] and dynamic vector excited [2] homomorphic vocoders.

This paper is organized as follows. In Section 2 the basic structures of two analysis-by-synthesis homomorphic vocoders, namely, static vector excited and dynamic vector excited homomorphic vocoders. are briefly discussed. In section 3 the method used to measure the performance of the

Ronald W. Schafer Georgia Institute of Technology School of Electrical Engineering

Atlanta, Georgia 30332-0250

homomorphic vocoders is described. In Section 4 the subjective test results of the homomorphic vocoders are reported.

2. ANALYSIS-BY-SYNTHESIS HOMOMORPHIC VOCODERS

2.1. Static Vector Excited Homomorphic Vocoder In the static vector excited homomorphic vocoder

(SW). the excitation signal e[n] is composed of the two parts: b e [ n -"p]. which represents a short segment of the past (previously computed) excitation beginning "p samples before the present excitation frame, and Blfx[n], where fr,[nl is a zero-mean Gaussian codebook sequence corresponding to index in the codebook i.e..

e b l = Poeb -%I + PlfrJnI. (1)

At the synthesizer, the excitation sequence e [n] is convolved with the vocal tract impulse response h [ n ] to produce the synthetic speech output f[n].

Table 1 shows the bit allocations used for a 4.8 Kbps SVHV. For speech sampled at 8 KHz frame intervals of 20 msec (1 60 samples) and 5 msec (40 samples) are used for the vocal tract analysis and excitation analysis. respectively. A sequence of 11 low-time cepstrum values is vector quantized using a 256 size full search codebook. The Euclidean cepstral distancemeasure &- defined as

11

I =1 d=(c,e) = c [i] - f [ i ] I 2 (2)

is used as a distortion measure., where c and P are the vectors of original and reproduced real cepstrum values. Each of the two excitation gain parameters and is coded using a dbit APCM coder. Henceforth this system will be denoted as SVHV-4800.

BitsiFrame

Table 1: Bit allocations used for SVHV-4800.

2.2 Dynamic Vector Excited Homomorphic Vocoder

11-117 0-7803-0532-9192 $3.00 0 1992 IEEE

In the SVHV, the excitation sequence has a bed form shown in Eqn. (1) -dent of the voicing state of the speech segment. Figure 3 shows three distinctly different types of speech segments, namely, voiced, unvoiced, and mixed. In the caw of a voiced segment, the speech waveform is dominated by the quasi-periodic nature. For an unvoiced segment, the speech waveform appears to vary randomly. Finally, a mixed speech segment has the quasi-periodic nature as well as the random nature. In the dynamic vector excited homomorphic vocoder (DVHV), for more effective and efficient excitation modeling, different excitation modeling strategies are applied for the diffemnt classes of speech sounds. i.e., voiced, unvoiced, and mixed. For VOW mgmsnts, the excitation signal e [n] is compdised of two sequences selected from a time-varying queue of the past excitation history, i.e..

(3)

This emphasizes the pitch dependent periodic nature. In the unvoiced case, the excitation signal e[n] is modeled by a Gaussian codebook sequence, i.e..

e[nl = b e [ n -'yo1 + @le tn -711.

(4)

emphasizing the random nature. Finally, in the case of segments classified as mixed excitation, the excitation e [n 1 is modeled as the sum of a Gaussian codebook sequence and a sequence selected from the fixed interval of the past excitation history, i.e..

Since different voicing states result in different excitation modeling, the classification of a given speech block into a correct voicing state is important. The cepstrum has been successfully applied as a tool for voicing decision as well as pitch detection in the speech processing literature [5] [6]. Figure 4 shows the corresponding cepstnrm of the speech segments shown in Figure 3. The periodic-nature of the voiced speech segment results in strong cepstral peaks, whereas the random nature of the unvoiced speech segment does not cause any strong cepshal peaks. In the homomorphic vocoder, the vocal tract infoxmarion is repxwented by the low-time cepstrum and therefore the cepstrum is available without any additional computation. Consequently, a voicing classification method primarily based on the cepst" is adopted in the DVHV. Details of the voicing classification algorithm can be found in [2].

Two different systems have been studied. Table 2 shows the bit allocations used for voiced, unvoiced, and mixed in the first system. The maximum bit rate is 5600 bps when the voicing state is mixed, and the minimum bit rate is 3200 bps when the voicing state is unvoiced. The average bit rate of this system is about 4400 bps. A sequence of the cepstrum values is vector quantized. The vector codebook is composed of three sub-co&books, and the sub-codebooks are arranged sequentially so that the search for the best codeword can be restricted to the particular subcodebook depending on the voicing state. For example, suppose the codebook is arranged such that first region is filled with the sub-codebook for voiced, second region with the sub-codebook for unvoiced, and third region with the subcodebook for mixed. In this case, when

1- -

the voicing state is classified as unvoiced, the search for the codeword can be restricted to the second region of the codebook, and the index of the best vector is transmitted to the receiver. The index transmitted to the receiver contains the information about the codeword chosen as well as the voicing state due to the known structure of the codebook. Hence, no additional information about the voicing state is necessary at the receiver. This system will be denoted as DVHV-4400.

. unvoiced 1 12 I 5 1 8 1 40 1 160 I 3200

m i x e d 1 12 1 5 5 1 7 8 1 40 I 160 I 5600

Table 2: Bit allocations used for DW-4400,

The bit allocations used for the second system are shown in Table 3. The maximum bit rate is 4800 bps when the voicing state is either voiced or mixed. The minimum bit rate is 2667 when the voicing state is unvoiced. The average bit rate of this system is about 3800 bps. The system is denoted as DVHV-3800.

mixed 1 12 1 5 5 1 7 7 1 45 I 180 I 4800

Table 3: Bit allocations used for DVHV-3800.

3. SUBJECTIVE EVALUATION For the subjective evaluation of the homomorphic

vocoders in the study, a variation of the Paired Acceptability Rating Method (PARM) was used as a subjective quality measure [7]. The PARM measure is an isometric speech quality measure in which, the listeners rate the perceived speech quality of a set of speech communication system on a scale of 0 to 100. In our case, a speech communication system is defined as a completely specified method of coding speech. The PARM test involves the comparison of all possible pairs of systems in that PARM module. The synthetic speech from a pair of systems is then presented to the listeners, and the listeners are asked to rate each sentence on a score of 0 to 100. The high and low anchors have fixed scores of 80 and 20 respectively, and the anchors are presented to the listeners at the beginning and periodically during the test. Providing the anchors to the listeners allows the listeners responses to be normalized, i.e., each of the system can be evaluated relative to the anchors.

In our subjective tests, an undistorted original signal was used as the high anchor. The low anchor was a heavily distorted synthetic speech signal produced by a homomorphic vocoder using a Gaussian random ensemble as the excitation. In the tests. the order of all possible pairs were presented to the

11-118

listeners and the systems presented in each pair were randomly ordered. In our study, 3 utterances (spoken by 2 males and 1 female) were used for the tests. Sixteen untrained listenem participated in the subjective tests.

For each system, the subjective tests were analyzed by two statistics, namely, the mean score and the standard deviation of the mean score.. For a given system k , the lis" pferenC0 Bcon is represented 1111 Rk (s .1 ,?I ), Whera

s : the sentence generated by the system A 1 : the listener who produced the score n :theiterationofpresentationofsentences toliste-ne!rl.

With this notation, the mean score for coding system k. A ( k ) is then

r

Average Score

High Anchor 74.8 DoD-4800 66.0 CELP-5000 55.5 DVHV-4400 54.5 SVHV-4800 53.7

where N, : the number of sentences NI : the number of listeners N. : the number of iterations for each sentence.

The standard deviation for coding system k. o(k) is then expressed as

Standard Deviation

6.9 7.5 6.0 6.2 5.6

4. SUBJECTIVE TEST RESULTS

BitsFrame v o c a l ~ ~ t I Bo I yo y1

34 1 5 5 1 7 7

The subjective tests included five fully quantized vocoders besides the high and low anchors. Three of them were homomorphic vocoders; i.e., SVHV-4800, DVHV-4400, and DVHV-3800. In addition, two CELP coders were included for comparison. One was a software implementation of the proposed federal standard 1016 4800 bps CELP coder, which was obtained from the Department of Defense [8]. This coder is referred to as DoD-4800.

The second CELP system implementation was coded by the authors. It was a modified version of our SVHV-4800 system, which employed an LPC-based vocal tract model. In this system, the vocal tract parameters and excitation gain parameters were quantized based on the proposed DoD standard speciiications. However, many algorithmic techniques implemented in the proposed DoD standard were not implemented. An interpolation of the vocal tract impulse response parameters from one frame to the next frame is an example. The integer and non-integer adaptive codebook search technique used in determining the excitation signal is an another example. This coder is referred to as CELP-5000, and the bit allocations used are shown in Table 4.

Samples'Frame Bit Rate excitation I vocaltract (bitdsec)

52 I 208 5000 1 I

Table 4 Bit allocations used for the CELP-5000.

The test results are sum"A . in Table 5. It is clear that the DoD-4800, with a PARM score of 66.0, performed sisnificantly better than all the other test systems. The scores for the other systems ranged &om 55.5 for the CELP-5000 down to 51.4 for the DVHV-3800. The differences between the average PARM s a x e s for CELF-SO00 and the homomorphic systems were not statistically significant. Also, the differences among the ho"0rphic vocoders were not statistically signi6cant.

Comparing the homomorphic vocoders to CELP-5000 suggests that the homomorphic vocal tract model when used with an analysis-by-synthesis excitation model can produce coded speech with essentially the same quality as LPC-based CELP &s. Furthermore, dynamic excitation modeling appears to be effective in lowering average bit-rate while maintaining speech quality. Finally, the noticeable difference between the average PARM score for the DoD coded CELP system and the score for our version of the proposed DoD standard can only be amibuted to the omission in our system of advanced algorithmic techniqum that have been developed m the last few years [9]. Therefore, we speculate that the homomorphic systems would also achieve significantly higher PARM scores if these features were implemented.

DVHV-3800 I 51.4 I 6.8 Low Anchor I 25.0 I 4.6

Table 5: Subjective test results of homomorphic vocoders.

5. CONCLUSIONS In this paper, performance of analysis-by-synthesis homomorphic vocoders was evaluated through formal subjective listening tests. A variation of the Paired Acceptability Rating Method was used as a subjective quality measure. The subjective test results showed that the new v& framework using analysis-by-synthesis excitation analysis is capable of producing good speech quality at 4.8 Kbps or lower.

REFERENCES J. H. Chung and R. W. Schafer, "A 4.8 Kbps homomorphic vocoder using analysis-by-synthesis excitation analysis," Proc. Int. Cog. Acoust., Speech, Signal Processing, pp. 144-148, 1989.

J. H. Chmg and R. W. Schafer, "Excitation modeling in a homomorphic v&r," Proc. Int. Cog. A c o a . , Speech, Signal Processing, pp. 25-28.1990.

11-119

[3] B. S. Atal and J. R. Remde, "A new model of LPC excitation for producing natural-sounding speech at low bit rates," Proc. Int. Colg. Acoust., Speech, Signal Processing. pp. 616617,1982.

[4] M. R. Schroeder and B. S. Atal, "Code-excited linear prediction (CELP): Highquality speech at very low bit rates," Proc. lnt. Cod. Acoust., Speech, Signal Processing, pp. 937-940.1985.

[5] A. No& "Cepstrum pitch detemhation.'' J o m l of Acoust. Soc. Amer., pp. 293-309. vol. 41, Feb., 1967.

[6] R. W. Schafer and L. R. Rabiner, "Systan for automatic formant analysis of voiced speech," J d ofAcoust. Soc. Amer., pp. 634-648, vol. 47, Feb., 1970.

[q W. D. Voiers, "Methods of predicting - acceptmce of voice communications systems," Final Report 100- 74-C-0056, DCA. AM. 1976.

[8] Department of Defense, The DoD 4.8 kbps standard (Proposed Federal Standard 1016), Third DrM, Aug., 1990.

[9] B. S. Atal and V. Cupexman and A. Gash0 (editors). Advances in speech coding, Kluwa Academic Publishers, December, 1990.

2 -

I -

3 ~ 1 0 4 I

0-

2

0

1

1

I I

(a) '0 50 100 I S 0 2M) 250 300 350

2000

I 0 0 0

0 i -1000 I '

0 50 100 150 200 250 300 350 -2000

(b)

I I

samples ( c )

'0 50 100 150 200 250 300 350

Fig. 3: A speech segment of (a) voiced. @) unvoiced, and (c) mixed.

10 I I

Fig. 1: Homomorphic filtering for estimating vocal tract impllse response.

I DISCRETE kw[n]i DISCYTE I CONVOLUTION CONVOLUTION

CONVOLUTION

EXCITATION ERROR +I MINIMIZATION weighted Error

w [n] : perceptual weighting film

Fig. 2 Analysis-by-synthesis for obtaining excitation sequence.

r--- ~-

300 400 500 600 100 200 samples

( C )

Fig. 4 Cepstnuns of speech segment of (a) voiced, (b) unvoiced, and (c) mixed.

11-120

Documents

00226106