Very low bit-rate F0 coding for phonetic vocoders using MSD-HMM with quantized F0 symbols

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 54 (2012) 384–392

Very low bit-rate F0 coding for phonetic vocoders usingMSD-HMM with quantized F0 symbols

Takashi Nose ⇑, Takao Kobayashi

Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama 226-8502, Japan

Received 21 February 2011; received in revised form 4 October 2011; accepted 6 October 2011Available online 14 October 2011

Abstract

This paper presents a technique of very low bit-rate F0 coding for phonetic vocoders based on a hidden Markov model (HMM) usingphone-level quantized F0 symbols. In the proposed technique, an input F0 sequence is converted into an F0 symbol sequence at thephone level using scalar quantization. The quantized F0 symbols represent the rough shape of the original F0 contour and are usedas the prosodic context for the HMM in the decoding process. To model the F0 that has voiced and unvoiced regions, we use multi-spaceprobability distribution HMM (MSD-HMM). Synthetic speech is generated from the context-dependent labels and pre-trained MSD-HMMs by using the HMM-based parameter generation algorithm. By taking into account the preceding and succeeding contexts as wellas the current one in the modeling and synthesis, we can generate a smooth F0 trajectory similar to that of the original with only a smallnumber of quantization bits. The experimental results reveal that the proposed F0 coding outperforms the conventional segment-basedF0 coding technique using MSD-VQ. We also demonstrate that the decoded speech of the proposed vocoder has acceptable quality evenwhen the F0 bit-rate is less than 50 bps.� 2011 Elsevier B.V. All rights reserved.

Keywords: Phonetic vocoder; HMM-based speech synthesis; Very low bit-rate speech coding; MSD-HMM; MSD-VQ

1. Introduction

Segment-based coding is one of the most popularapproaches to very low bit-rate speech coding at a rateon the order of 100 bits/sec. In the segment-based coding,several frames are regarded as an acoustic segment andencoded into a discrete symbol using a codebook trainedin advance. One of the typical segment-based coders is pho-netic vocoder where a phone is used as an acoustic unit forencoding and decoding processes.

The basic idea of the phonetic vocoder was introducedin the 1950s by Dudley (1958), where the number of pho-nemes was very limited and the quality of the decodedspeech was not much satisfactory. With improvements tocomputational performance and the development of pho-

0167-6393/$ - see front matter � 2011 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2011.10.002

⇑ Corresponding author. Tel.: +81 45 924 5030; fax: +81 45 924 5055.E-mail addresses: [email protected] (T. Nose), takao.

[email protected] (T. Kobayashi).

neme recognition and speech synthesis, a number of tech-niques were studied in the 1980s, e.g., Schwartz et al.(1980), Soong (1989), Picone and Doddington (1989). Inthese techniques, the encoding was based on phoneme rec-ognition and the decoding was based on concatenative syn-thesis of phone units. In the late 1990s, a parametergeneration algorithm for HMM-based speech synthesis(Tokuda et al., 1995b) was employed into the decoder part(Tokuda et al., 1998) in place of the unit selection. In theHMM-based phonetic vocoder, a spectral feature sequenceis generated from the HMMs that are used as acousticmodels in the phoneme recognition.

The HMM-based phonetic vocoder is a promisingapproach to very low bit-rate spectral coding and cangenerate natural sounding speech using relatively a smalleramount of speech data than that in unit selection.However, most of the related studies have mainly focusedon spectral coding, and prosodic coding, especially thefundamental frequency (F0), has not been well discussed.

http://dx.doi.org/10.1016/j.specom.2011.10.002

mailto:[email protected]



http://dx.doi.org/10.1016/j.specom.2011.10.002

T. Nose, T. Kobayashi / Speech Communication 54 (2012) 384–392 385

It is obvious that F0 is also an essential factor in speechrepresentation to express linguistic information such asaccent and tone. Moreover, F0 often carries para-linguisticcues such as emotion and speaking style which enhance thespeech communication. One of the difficulties in the F0coding at very low bit-rates is that it is not easy to automat-ically extract such linguistic and para-linguistic informa-tion whereas spectral features can effectively berepresented by the phonetic information obtained withphoneme recognition.

Several techniques have been proposed to overcome theproblem with F0 coding. Picone and Doddington (1987)proposed contour quantization, where the F0 contourwas normalized by a nominal value and was vector-quantized. Lee and Cox (2001) proposed an alternativetechnique using piecewise linear approximation (Scheffers,1988), which is similar to polygon approximation(Katsaggelos et al., 2002) in image coding. However, thedecoded F0 contour was linear within each segment andF0 variations in segment boundaries were not smooth. Inaddition, the above two techniques did not use phoneticinformation through the F0 coding process, which couldenhance the coding efficiency. To address these problems,Hoshiya et al. (2003) introduced a statistical approach intothe segment-based F0 coding. They proposed the vectorquantization based on multi-space probability distribution(MSD-VQ). In this technique, F0 values are modeled withthe MSD where the observation space of F0 features is rep-resented by a union of voiced and unvoiced spaces.Although this approach is feasible to statistically treat F0values, codebooks are separately trained for respectivephonemes and codewords depend only on current pho-nemes. This means that the context of preceding and suc-ceeding phonemes are not taken into account in theMSD-VQ. In contrast, since the synthesis unit is generallymodeled with a phonetically and prosodically context-dependent HMM in the speech synthesis, there is inconsis-tency between encoding and decoding processes in MSD-VQ.

In this paper, an F0 coding technique for HMM-basedphonetic vocoders is proposed where MSD-HMM(Tokuda et al., 1999) is used for F0 modeling. MSD-HMM is generally used in HMM-based speech synthesisto model F0 that has a continuous value in a voiced frameand has a discrete symbol in an unvoiced frame. In the pro-posed technique, we model the F0 values of each phoneunit using a phonetically and prosodically context-depen-dent MSD-HMM. As described above, it is difficult toautomatically extract prosodic contextual factors such asaccent and tone with high reliability, and inaccurate pro-sodic context could degrade the performance of F0 coding.To overcome this problem, we employ quantized F0 sym-bols (Nose et al., 2010a) which were originally proposedfor unsupervised prosodic labeling in HMM-based speechsynthesis. We obtain the quantized F0 symbol for eachphone by quantizing an average log F0 value of the phone.The F0 symbol sequence represents a rough shape of the

original F0 contour and these symbols are used as the pro-sodic context for a current phone as well as the phoneticcontext, i.e., triphone. This means that we can use not onlythe contextual factors for the current phones but also thosefor the preceding and succeeding phones, which is one ofthe advantages of the proposed F0 coding techniqueagainst the above conventional techniques.

The contributions of the proposed technique to very lowbit-rate F0 coding are summarized as follows. The first isthe use of phonetic and prosodic contexts. In the conven-tional coding technique with the MSD-VQ, the codebookwas trained for each phoneme separately, and the phoneticand F0 information of the preceding and succeeding phonesegments are not taken into account. However, it is wellknown in unit-selection-based speech synthesis that thelack of the information of adjacent phone segments oftencauses discontinuity in the boundaries of the concatenatedsegments, and the quality of the resultant synthetic speechis not always satisfactory. On the other hand, the proposedtechnique takes advantage of the HMM-based speechsynthesis by using the phonetically and prosodically con-text-dependent model, where the contexts capture thesupra-segmental characteristics of spectral and F0 trajecto-ries as well as the segmental ones within each phone. Thesecond is robust parameter estimation using parametertying with decision trees. Generally, in the MSD-VQ, theappropriate codebook sizes for phonemes are differentfrom each other, and we should manually adjust the respec-tive codebooks’ sizes of all phonemes if we wish to reducethe bit-rate as much as possible. On the other hand, in theproposed technique, the number of parameters are auto-matically determined using state-based decision trees withthe minimum description length (MDL) criterion (Shinodaand Watanabe, 2000), where the phonemes and F0 symbolsare taken into account as the contextual factors. Thismeans that the parameters are tied across the phonemesand F0 symbols, and it would make the parameterestimation and F0 modeling more robust.

This paper is organized as follows. In Section 2, wepropose the F0 coding technique based on quantized F0symbols and context-dependent MSD-HMMs. In Section3, we conduct objective and subjective experiments and dis-cuss the results. Finally, Section 4 summarizes our findings.

2. F0 coding based on MSD-HMM

2.1. F0 encoding using phone-level F0 quantization

In the encoder, an extracted F0 contour is convertedinto an F0 symbol sequence at the phone level. Each F0symbol is obtained by roughly quantizing the average logF0 value of each phone. The resulting F0 symbol sequencerepresents the outline of the original F0 contour. In ourprevious studies on unsupervised F0 modeling (Noseet al., 2010a) and voice conversion (Nose et al., 2010b),we found that these F0 symbols could be used as a prosodiccontext for HMM-based speech synthesis.

Quantization codeword

μ-3σ μ μ+3σ Log F0

1

(a) one-bit


1 2 3

(b) two-bit


1 74 63 52

(c) three-bit

Fig. 2. Examples of quantization codewords for F0 quantization.

log

F0

dj nateamienpaueoohonNb ai onnu

1

13 22x355533x235x6674 14 2436

65432

7

Recognized phoneme symbol

QuantizationQuantization boundary

F0 symbol

Fig. 3. Example of F0 encoding using three-bit quantization.

386 T. Nose, T. Kobayashi / Speech Communication 54 (2012) 384–392

For the F0 quantization, we first count the numbers ofvoiced and unvoiced frames of each phone segment. Ifthe number of unvoiced frames is greater than that ofvoiced frames, we label the segment with an unvoiced sym-bol. Otherwise, we regard the segment as a voiced regionand conduct scalar quantization using the statisticalparameters of the input speaker’s F0 distribution. Specifi-cally, we assume that the log F0 values of the input speakerfollow a normal distribution. Before quantizing F0, we cal-culate the global mean l and variance r2 of log F0 valuesfrom the speaker’s training data. We calculate the mean �f p

of the log F0 values for each phone segment p, where thephone boundaries are obtained by phoneme recognitionused for spectral coding. An F0 symbol sp is then obtainedby quantizing �f p into a discrete value out of 2N � 1 sym-bols as follows:

sp ¼ Q½�f p�; sp 2 f1; . . . ; 2N � 1g; ð1Þ

where Q[�] denotes an operation of scalar quantization, andN is the number of quantization bits. As described above,an additional symbol is assigned when the phone segmentis unvoiced. Fig. 1 shows an example of the distributionof log F0 values for a male speaker included in the ATRJapanese speech database. Since we have empirically foundthat there are only a few log F0 values lying outside theinterval [�3r, 3r], we set the 2N � 1 points that equally di-vide the interval [�3r, 3r] as the codewords. Figs. 2 and 3show examples of quantization codewords and phone-levelF0 encoding. The quantized F0 symbols are then transmit-ted to the decoder with phoneme labels and durations.Although entropy coding can be used for transmittingthese symbols with durations in a manner similar to thatin (Hoshiya et al., 2003), we here focus on the F0 codingand do not use such additional techniques to reduce bit-rates to reveal the intrinsic nature of the proposedapproach.

2.2. F0 modeling with MSD-HMM

We model the F0 values of an input speaker based onMSD-HMM (Tokuda et al., 1999) in the decoding process.

0

1.0e4

2.0e4

3.0e4

4.0e4

3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6Log F0

μ-3σ μ+3σμ+2σμ+σμμ-σμ-2σ

Freq

uenc

y

Fig. 1. Example of log F0 histogram for 450 utterances of a male speakerMHT included in ATR Japanese speech database.

MSD-HMM is an extension of HMM that can simulta-neously model the continuous F0 values in voiced framesand discrete symbols for unvoiced frames. In the MSD, asample space X is represented by a union of G differentspaces X1,X2, . . . ,XG. Each space has its occurrence proba-bility wg, i.e., P(Xg) = wg, where

PGg¼1wg ¼ 1. If the dimen-

sion ng of space G is greater than zero, the space has aprobability density function NgðxÞ; x 2 Rng , whereR

Rng NgðxÞdx ¼ 1. If ng = 0, we assume that Xg containsonly one sample point and NgðxÞ � 1. The observationis a random variable o that consists of a continuous vari-able x 2 Rn and a set of space indices X, i.e.,

o ¼ ðx;X Þ; ð2Þ

where all spaces specified by X are n-dimensional. The like-lihood of o is defined by

bðoÞ ¼X

g2SðoÞwgNgðV ðoÞÞ; ð3Þ

where

V ðoÞ ¼ x; SðoÞ ¼ X : ð4Þ


The output probability in each state of MSD-HMM is gi-ven by the MSD. In the F0 modeling, a sample space con-sists of two spaces: a voiced space having continuous valuesand an unvoiced space having only one discrete symbol. Inaddition, we use multi-stream HMM to simultaneouslymodel the spectral and F0 features.

To model F0 values using MSD-HMM, we need pho-netic and prosodic context labels for training data. We cre-ate the quantized F0 context for the training data in thesame manner as is described in Section 2.1, and use it asthe prosodic context. The difference between the encodingand labeling processes is that we use correct phonetic labelsfor the training data in the model training. It is noted thatin both processes we use only the triphone and correspond-ing F0 symbols as the phonetic and prosodic contexts. Themodel training procedure is based on the maximum likeli-hood (ML) estimation of model parameters, and this isalmost the same as that for speech recognition. The detailsof the MSD-HMM and its application to F0 modeling isfound in (Tokuda et al., 1999).

2.3. Parameter tying using context clustering

The number of model parameters becomes very large inthe training of MSD-HMMs because we use the phoneti-cally and prosodically context-dependent models. Toreduce the number of parameters, parameter tying usingdecision-tree-based context clustering is applied to eachstate of the HMMs. We use prosodic questions, i.e., ques-tions for the quantized F0 context as well as for the pho-netic ones in the construction of trees. We can effectivelyreduce the number of parameters for F0 features by usingthese questions in the clustering of model parameters. Weuse the MDL as a stopping criterion of tree construction.Fig. 4 shows an example of a part of a decision tree con-structed using 450 sentences of a male speaker MHTdescribed in Section 3.1. From the figure, it is seen thatnot only the questions related to current F0 symbols butalso those related to phonemes and succeeding F0 symbolsare used in tree construction.

C_f0 == x?

C_f0 <= 3? C_voiced?

C_f0 <= 4? C_f0 <= 2? C_unvoiced_vowel? C_Alveolar_consonant?

R_f0 <= 3? C_f0 <= 1?

No Yes

No

No

No

Yes

YesYes

C_f0 : Current F0 symbolR_f0 : Succeeding F0 symbol

Fig. 4. Example of a constructed tree for the second state of three-stateMSD-HMM.

2.4. F0 decoding from MSD-HMM

In the decoder, a context-dependent label sequence forspeech synthesis is created from phoneme, duration, andF0 symbol sequences. Preceding, current, and succeedingphonemes and F0 symbols are used as the contextual fac-tors of the acoustic features in the labels. We train con-text-dependent phone MSD-HMMs in advance using theinput speaker’s speech data and labels. We use STRAIGHTanalysis (Kawahara et al., 1999) and extract spectral enve-lope, F0, and aperiodicity features to extract the speech fea-tures. Then, the spectral envelope is converted to mel-cepstral coefficients using a recursion formula. The aperio-dicity feature is also converted to average values for five fre-quency sub-bands.

A sentence MSD-HMM is created from phone MSD-HMMs in accordance with the given label sequence, andmel-cepstrum, F0, and aperiodicity sequences are gener-ated from the sentence MSD-HMM using the HMM-basedparameter generation algorithm with the ML criterion(Tokuda et al., 1995a). Finally, the speech waveform is syn-thesized using STRAIGHT synthesis. By using the pho-netic and F0 symbols of the preceding and succeedingphones as a context in the synthesis labels, we can effi-ciently save the number of quantization bits compared tothe case where we use only the symbols for the currentphone.

2.5. Proposed coder

Fig. 5 illustrates a block diagram of the proposed pho-netic vocoder. In the encoder part, spectral and F0 feature

Fig. 5. Block diagram of the proposed phonetic vocoder.

Table 1Bit-rates [bits/sec] of three techniques.

Number of quantization bits

1 2 3 4 5

MSD-VQ 16 31 47 63 79Frame-based 200 400 600 800 1KMSD-HMM 16 31 47 63 79


sequences are extracted from the input speech. We useMFCCs for the spectral feature because they are widelyused in state-of-the-art automatic speech recognition sys-tems. A phoneme sequence with state durations is obtainedfrom the MFCC sequence of the input speech using a pho-neme recognizer with the input speaker’s pre-trainedHMMs. The F0 sequence is converted into an F0 symbolsequence using F0 quantization, which is described in Sec-tion 2.1. The obtained phonemes, F0 symbols, and statedurations are then transmitted to the decoder part. In thedecoder part, a phonetically and prosodically context-dependent label sequence is created from the phonemes,F0 symbols, and state durations. Finally, decoded speechis generated from the label sequence using the HMM-basedparameter generation algorithm in the decoder as describedin Section 2.4.

3. Experiments

3.1. Experimental conditions

We used reading style speech of three male (MHT,MSH, and MYI) and three female (FKN, FTK, andFYM) speakers from the ATR Japanese speech databaseset B in the following experiments. Each speaker uttered503 phonetically balanced Japanese sentences. We used450 sentences for model training and the remaining 53 sen-tences for evaluation. The average phoneme recognitionrate including insertion error for the test data of six speak-ers was 78.3%. The average phoneme rate computed fromthe recognition results for the test data of six speakerswas 15.7 phonemes/sec. Speech signals were sampled at arate of 16 kHz, and the interval of frame shift was 5-ms(200 frames/sec). The feature vector for the encoder con-sisted of 26 MFCCs including static and their delta param-eters, where the log energy was normalized. The featurevector of the decoder consisted of 39 mel-cepstral coeffi-cients including the zeroth coefficient, log F0, band aperio-dicity values of five sub-band, i.e., 0–1, 1–2, 2–4, 4–6, and6–8 kHz, and their delta and delta-delta coefficients. Thetotal number of dimensions was 138.

We used three-state left-to-right with no skip topologyboth in HMM and MSD-HMM. The output distributionin each state was modeled by a single Gaussian densityfunction, and the covariance matrices of these models wereassumed to be diagonal. We used questions for F0 symbolsin the clustering of the spectral features as well as the F0features. To reduce the perceptual degradation of syntheticspeech quality caused by over-smoothing of the spectralenvelope, we used a parameter generation algorithm con-sidering global variance (Toda and Tokuda, 2007) for thespectral features only in the subjective evaluation.

3.2. Objective evaluation

We first evaluated the performance of the proposed F0coding technique for different numbers of quantization bits

from 1 to 5. We used the root mean square (RMS) error oflog F0 between original and synthetic speech samples toobjectively measure F0 distortion. We also evaluated sim-ple frame-based coding and MSD-VQ for comparison.The frame-based coding was conducted using scalar quan-tization, where the same codewords were used as those inthe proposed technique. We smoothed the decoded F0 con-tour in the frame-based coding using the moving averagewith five preceding and succeeding frames to alleviate thediscontinuity between frames. The number of states inMSD-VQ was set to three, which was the same as thatfor MSD-HMM. To focus only on the F0 coding perfor-mance, the spectral sequence generated using the proposedtechnique was also used in the frame-based coding. The bit-rates for the three techniques with respective quantizationbits are listed in Table 1. The average bit-rate r [bits/sec]is given by

r ¼ n� rp; ð5Þ

where n is the number of quantization bits and rp is theaverage number of phonemes per second for all speakers.

Fig. 6(a) plots the average RMS error of log F0 for thesix speakers. When the number of quantization bits isone, two codewords are generated in MSD-VQ. On theother hand, one of the codewords for the proposed tech-nique is used as the unvoiced symbol and the actual numberof codewords becomes only one. This caused the larger F0distortion in MSD-HMM. However, the proposed codingtechnique gave smaller F0 distortion than MSD-VQ in allthe other numbers of bits. The distortion in the frame-basedcoding is smaller than that in the proposed technique whenthe number of quantization bits is more than two. However,it is noted that the total bit-rate in the frame-based coding ismuch higher than that in the proposed technique.

As well as the static characteristics of log F0, thedynamic characteristics could also affect the naturalnessof the synthetic speech. Hence, we calculated the RMSerror of the delta log F0 values between original anddecoded F0 values. The results are plotted in Fig. 6(b),where it is seen that the proposed technique outperformedMSD-VQ even when the number of quantization bits wasonly one. A possible reason for this is that we used preced-ing and succeeding phonetic and F0 symbols as the contextand this could be useful for predicting the dynamic charac-teristics of F0.

To confirm the effectiveness of contextual factors, wealso evaluated the coding performance when precedingand/or succeeding phonetic and prosodic contexts were

0

100

200

300

400

500

1 2 3 4 5

RM

S er

ror o

f log

F0

[cen

t]


MSD-VQFrame-based

MSD-HMM

(a) log F0

25

30

35

40

45

50

1 2 3 4 5

RM

S er

ror o

f log

F0

[cen

t]


MSD-VQFrame-based

MSD-HMM

(b) delta log F0

Fig. 6. RMS error of log F0 for different numbers of quantization bits.

0

100

200

300

400

500

1 2 3 4 5

RM

S er

ror o

f log

F0

[cen

t]


w/o both contextsw/o F0 context

w/o phonetic contextwith both contexts

(a) log F0

25

30

35

40

45

50

1 2 3 4 5

RM

S er

ror o

f log

F0

[cen

t]


w/o both contextsw/o F0 context

w/o phonetic contextwith both contexts

(b) delta log F0

Fig. 7. Effect of the preceding and succeeding phonetic and prosodic contexts.

4.7

4.8

4.9

5

5.1

5.2

1 2 3 4 5

Mel

-cep

stra

l dis

tanc

e [d

B]


MSHFYMMYI

FKN

MHT

FTK

Average

Fig. 8. Effect of the F0 context for spectral reproducibility of respectivespeakers.


not taken into account in the model training and speechcoding. Fig. 7(a) and (b) plot the RMS errors of the staticand delta log F0 values, respectively. There are four cases:(1) without both of the phonetic and F0 contexts, (2) with-out the phonetic context, (3) without the F0 context, (4)with both of the phonetic and F0 contexts. It is noted thatthe phonetic and F0 symbols for the current phone wereused in all cases. In Fig. 7(a), although the use of the F0context decreased the static log F0 distortion in one-bitF0 quantization, the effect of contexts disappeared whenthe number of quantization bits increased. On the otherhand, when neither phonetic nor F0 contexts was used,the delta log F0 distortion in Fig. 7(b) increased in everynumbers of quantization bits. From these results, the pre-ceding and succeeding F0 contextual factors are shown tobe effective to improve the coding performance. It is notedthat we cannot also ignore the phonetic context because thecontext is essential to the coding of spectral features. Con-sequently, we need to use both of the phonetic and F0 con-texts in the proposed technique. When comparing theresults of Figs. 6 and 7, we also found that there seems

to be more dominant factors to reduce RMS errors thancontexts. One of the possible factors are the parametertying structure. In the MSD-VQ, a codeword is assignedto the mean parameters of all states, and the codebook

1

2

3

4

5

1 2 3 4 5

DM

OS


MSD-VQFrame-based

MSD-HMM

(a) DMOS

1

2

3

4

5

1 2 3 4 5

MO

S


MSD-VQFrame-based

MSD-HMM

(b) MOS

Fig. 9. Results of subjective evaluation tests.

(a) two-bit quantization

50

100

150 200

0 1.0 2.0 3.0 4.0 5.0 6.0

F0 [H

z]

MSD-VQ Time [sec]

50

100

150 200

0 1.0 2.0 3.0 4.0 5.0 6.0

F0 [H

z]

MSD-VQ Time [sec]

50

100

150 200

0 1.0 2.0 3.0 4.0 5.0 6.0

F0 [H

z]

MSD-HMM Time [sec]

50

100

150 200

0 1.0 2.0 3.0 4.0 5.0 6.0

F0 [H

z]

MSD-HMM Time [sec]

50

100

150 200

0 1.0 2.0 3.0 4.0 5.0 6.0

F0 [H

z]

Frame-based Time [sec]

50

100

150 200

0 1.0 2.0 3.0 4.0 5.0 6.0

F0 [H

z]

Frame-based Time [sec]

(b) four-bit quantization

Original F0 Decoded F0

Fig. 10. Example of decoded F0 contours with three techniques.


of each phoneme is trained independently for each other.This often causes discontinuous transition of F0 valuesamong phonemes, which results in the increase of the deltalog F0 distortion. On the other hand, the proposed tech-nique takes into account the phonetic and prosodic con-texts and conducts parameter tying among differentphonemes. This efficient parameter tying leads to robustparameter estimation, and the resultant F0 trajectorybecomes closer to the original than that of the MSD-VQ.

In the proposed technique, F0 symbols are also used asthe contextual factors for spectral modeling. This meansthat the F0 context could also affect the spectral reproduc-ibility. Fig. 8 shows the spectral distortion between originaland decoded speech samples for different numbers of F0quantization bits. In the figure, the results of respectivespeakers and their average are shown. From the results,we see that the effect of the F0 context depends on the tar-get speakers, and there is no significant effect in the averagescore.

3.3. Subjective evaluation

We conducted subjective evaluation with a degradationMOS (DMOS) test to evaluate the perceptual quality of F0coding. The synthetic speech samples were generated fromthe decoded F0 features with original spectral and aperio-dicity features using the three techniques evaluated in Sec-tion 3.2. This means that we focused on the degradation inF0 through speech coding in this experiment. Vocodedspeech was used as the reference. Seven participants evalu-ated the perceptual similarity of decoded speech samples toreference speech. Eight sentences were randomly chosen foreach participant from the 53 test sentences of six speakers,and participants rated degradation on a five-point scale,i.e., 1 for very annoying, 2 for annoying, 3 for slightlyannoying, 4 for audible but not annoying, and 5 for inau-dible. The scores are shown in Fig. 9(a) with confidenceintervals of 95%. It is clear from the figure that the pro-posed technique outperformed MSD-VQ for all numbers


bits. In addition, MSD-HMM had similar performance tothe frame-based F0 coding even though the bit-rate is muchlower than that for the frame-based.

Next, we evaluated the overall subjective quality of thephonetic vocoder using spectral and aperiodicity featuresgenerated from MSD-HMM. Seven participants evaluatedthe perceptual quality of decoded speech samples. Eightsentences were randomly chosen for each participant fromthe 53 test sentences of six speakers, and participants ratedthe speech quality on a five-point scale, i.e., 1 for bad, 2 forpoor, 3 for fair, 4 for good, and 5 for excellent. Fig. 9(b)shows the scores with confidence intervals of 95%. Wecan see that the quality of decoded speech of the proposedtechnique seems to be acceptable even when the number ofquantization bits is only two. Although there was a differ-ence between three- and four-bit quantization in theDMOS test for the MSD-HMM, there were not such differ-ences in the MOS test. A possible reason for this is that thepitch of decoded speech was slightly different from that ofthe original but it still sounded natural when the three-bitquantization was used. We also found that the MOS ofthe proposed technique was slightly better than theframe-based coding in the three- and four-bit F0 quantiza-tion though there were not significant differences betweenthe frame-based coding and the proposed technique inthe DMOS test. We confirmed that the differences in theMOS test were statistically significant at the 5% level inboth three- and four-bit quantization cases. One of the pos-sible reasons for the improvement is the simultaneous mod-eling of spectral and F0 features. In the frame-based F0coding, we used the spectral sequence generated using theproposed technique to focus only on the F0 coding perfor-mance. As is shown in the DMOS test, the frame-basedcoding with more than two-bit quantization gave compara-ble performance with the proposed technique when the ori-ginal spectral features were used in the decoding becausethere was no voiced/unvoiced (V/UV) mismatch betweenspectral and F0 features. However, when the generatedspectral features were used in the decoding, there weresome V/UV mismatches between generated spectral andoriginal F0, and this could degrade the scores of theMOS test. In contrast, the spectral and F0 features aresimultaneously modeled in the proposed technique, whichwould avoid the degradation.

3.4. Discussion

Fig. 10 shows examples of F0 contours generated withMSD-VQ, the frame-based coding, and MSD-HMM fordifferent numbers of quantization bits. The generated con-tour with MSD-HMM is smoother and more similar to theoriginal contour than those with MSD-VQ and the frame-based coding especially in two-bit quantization. The algo-rithm delay in the coding process with MSD-HMM isalmost the same as that in the conventional HMM-basedphonetic vocoder (Tokuda et al., 1998; Hoshiya et al.,2003). The delay in the encoder depends on the phoneme

recognizer, and it is generally on the order of 100 ms. Weneed the next phonetic and F0 symbols in the decoderand this is also on the order of 100 ms. The speech param-eter generation algorithm and STRAIGHT synthesis causeadditional delays. However, they would not be problemswhen the encoder/decoder was used for recording and playback purposes.

4. Conclusions

We proposed a technique of very low bit-rate F0 codingfor HMM-based phonetic vocoders. The proposed tech-nique utilizes the roughly quantized F0 symbols as the pro-sodic context for the HMM-based speech synthesis in thedecoding process. By taking account of preceding and suc-ceeding F0 symbols as contexts, the decoded F0 contour isvery smooth and is similar to that of the original even whenthe number of quantization bits is very small such as two orthree. Experimental results demonstrated that the proposedtechnique achieved approximately 47 bps for F0 codingwith acceptable speech quality. In future work, we intendto employ the speaker adaptation technique for speaker-independent F0 coding. It is also important to examinethe performance of the proposed F0 coding in noisy anderroneous conditions for the practical applications.

Acknowledgment

Part of this work was supported by JSPS Grant-in-Aidfor Scientific Research 21300063 and 21800020.

References

Dudley, H., 1958. Phonetic pattern recognition vocoder for narrow-bandspeech transmission. The Journal of the Acoustical Society of America30, 733–739.

Hoshiya, T., Sako, S., Zen, H., Tokuda, K., Masuko, T., Kobayashi, T.,Kitantura, T., 2003. Improving the performance of HMM-based verylow bit rate speech coding. In: Proceedings of the ICASSP 2003, pp.800–803.

Katsaggelos, A., Kondi, L., Meier, F., Ostermann, J., Schuster, G., 2002.MPEG-4 and rate-distortion-based shape-coding techniques. Proceed-ings of the IEEE, Special Issue Part Two: Multimedia SignalProcessing 86, 1126–1154.

Kawahara, H., Masuda-Katsuse, I., de Cheveigne, A., 1999. Restructuringspeech representations using a pitch-adaptive time-frequency smooth-ing and an instantaneous-frequency-based F0 extraction: Possible roleof a repetitive structure in sounds. Speech Communication 27, 187–207.

Lee, K., Cox, R., 2001. A very low bit rate speech coder based on arecognition/synthesis paradigm. IEEE Transactions on Speech andAudio Process. 9, 482–491.

Nose, T., Ooki, K., Kobayashi, T., 2010a. HMM-based speech synthesiswith unsupervised labeling of accentual context based on F0 quanti-zation and average voice model. In: Proceedings of the ICASSP 2010,pp. 4622–4625.

Nose, T., Ota, Y., Kobayashi, T., 2010b. HMM-based voice conversionusing quantized F0 context. IEICE Transactions on Information andSystems, 2483–2490.

Picone, J., Doddington, G., 1987. Low rate speech coding using contourquantization. In: Proceedings of the ICASSP’87, IEEE. pp. 1653–1656.


Picone, J., Doddington, G., 1989. A phonetic vocoder. In: Proceedings ofthe ICASSP’89, pp. 580–583.

Scheffers, M., 1988. Automatic stylization of F0-contours. In: Proceedingsof the 7th FASE symposium, pp. 981–984.

Schwartz, R., Klovstad, J., Makhoul, J., Sorensen, J., 1980. A preliminarydesign of a phonetic vocoder based on a diphone model. In:Proceedings of the ICASSP’80, IEEE. pp. 32–35.

Shinoda, K., Watanabe, T., 2000. MDL-based context-dependent sub-word modeling for speech recognition. Journal of the AcousticalSociety of Japan (E) 21, 79–86.

Soong, F., 1989. A phonetically labeled acoustic segment (PLAS)approach to speech analysis-synthesis. In: Proceedings of theICASSP’89, pp. 584–587.

Toda, T., Tokuda, K., 2007. A speech parameter generation algorithmconsidering global variance for HMM-based speech synthesis. IEICETransactions on Information and Systems E90-D, 816–824.

Tokuda, K., Kobayashi, T., Imai, S., 1995a. Speech parameter generationfrom HMM using dynamic features. In: Proceedings of the ICASSP-95, pp. 660–663.

Tokuda, K., Masuko, T., Yamada, T., Kobayashi, T., Imai, S., 1995b. Analgorithm for speech parameter generation from continuous mixtureHMMs with dynamic features. In: Proceedings of the Eurospeech, pp.757–760.

Tokuda, K., Masuko, T., Hiroi, J., Kobayashi, T., Kitamura, T., 1998. Avery low bit rate speech coder using HMM-based speech recognition/synthesis techniques. In: Proceedings of the ICASSP’98, pp. 609–612.

Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T., 1999. HiddenMarkov models based on multi-space probability distribution for pitchpattern modeling. In: Proceedings of the ICASSP-99, pp. 229–232.

Documents

Very low bit-rate F0 coding for phonetic vocoders using MSD-HMM with quantized F0 symbols