A multi space distribution (msd) and two-stream tone modeling approach to mandarin speech recognition

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 51 (2009) 1169–1179

A Multi-Space Distribution (MSD) and two-stream tonemodeling approach to Mandarin speech recognition

Yao Qian *, Frank K. Soong

Microsoft Research Asia, Beijing 100190, China

Received 12 May 2008; received in revised form 28 July 2009; accepted 5 August 2009

Abstract

Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the discontinuity of F0 between voicedand unvoiced transition has traditionally been a hurdle in creating a succinct statistical tone model for automatic speech recognition andsynthesis. Various heuristic approaches have been proposed before to get around the problem but with limited success. The Multi-SpaceDistribution (MSD) proposed by Tokuda et al. which models the two probability spaces, discrete for unvoiced region and continuous forvoiced F0 contour, in a linearly weighted mixture, has been successfully applied to Hidden Markov Model (HMM)-based text-to-speechsynthesis. We extend MSD to Chinese Mandarin tone modeling for speech recognition. The tone features and spectral features are fur-ther separated into two streams and corresponding stream-dependent models are trained. Finally two separated decision trees are con-structed by clustering corresponding stream-dependent HMMs. The MSD and two-stream modeling approach is evaluated on largevocabulary, continuously read and spontaneous speech Mandarin databases and its robustness is further investigated in a noisy, contin-uous Mandarin digit database with eight types of noises at five different SNRs. Experimental results show that our MSD and two-streambased tone modeling approach can significantly improve the recognition performance over a toneless baseline system. The relative tonalsyllable error rate (TSER) reductions are 21.0%, 8.4% and 17.4% for large vocabulary read and spontaneous and noisy digit speech rec-ognition tasks, respectively. Comparing with the conventional system where F0 contours are interpolated in unvoiced segments, ourapproach improves the recognition performance by 9.8%, 7.4% and 13.3% in relative TSER reductions in the corresponding speech rec-ognition tasks, respectively.� 2009 Elsevier B.V. All rights reserved.

Keywords: Tone model; Mandarin speech recognition; Multi-Space Distribution (MSD); Noisy digit recognition; LVCSR

1. Introduction

Mandarin as well as other Chinese dialects is known as amonosyllabically paced tonal language. Each Chinesecharacter, which is the basic morphemic unit in writtenChinese, is pronounced as a tonal syllable, i.e., a base syl-lable plus a lexical tone. All Mandarin syllables have astructure form of (consonant)–vowel–(consonant), whereonly the vowel nucleus is an obligatory element. If we con-sider only the phonemic composition of a syllable withouttone, the syllable is referred to as a base syllable. Following

0167-6393/$ - see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2009.08.001

* Corresponding author.E-mail addresses: [email protected] (Y. Qian), frankkps@

microsoft.com (F.K. Soong).

the convention of Chinese phonology, each base syllable isdivided into two parts, namely Initial and Final. The Initial(onset) includes what precedes the vowels while the Finalincludes the vowel (nucleus) and what follows it (coda).Most Initials are unvoiced and thus the tones are carriedprimarily by the Finals. Proper tonal syllable recognitionis critical to distinguishing homonyms of the same base syl-lables in applications where strong contextual informationis not available in general, e.g. recognizing the name of aperson or a place. A recognizer with high tonal syllable rec-ognition accuracy has many useful applications, e.g. objec-tive evaluation of tonal language proficiency of a speaker.

It should be obvious that tone plays an important role inperceiving a Chinese tonal syllable. However, to constructa succinct tone model, which is critical for automatic tonal

http://dx.doi.org/10.1016/j.specom.2009.08.001

mailto:[email protected]

mailto:frankkps@ microsoft.com

mailto:frankkps@ microsoft.com

1170 Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179

syllable recognition, is not trivial. The discontinuity in F0contour between voiced and unvoiced regions has madethe modeling difficult. Heuristic approaches like interpolat-ing F0 in unvoiced segments to get around the discontinu-ity problem have been proposed (Hirst and Espesser, 1993;Chen et al., 1997; Chang et al., 2000; Freij and and Fall-side, 1988; Wang et al., 1997; Tian et al., 2004; Lei et al.,2006). The interpolated F0 can be generated from a qua-dratic spline function (Hirst and Espesser, 1993), an expo-nential decay function towards the running F0 average(Chen et al., 1997), or a probability density function(pdf) with a large variance (Chang et al., 2000; Freij andand Fallside, 1988). These approaches are instrumentallyeffective since F0 information can be augmented as extrainformation with short-time spectral features frame syn-chronously. As a result, the concatenated spectral andpitch features are used frame synchronously in one-passViterbi decoding. However, the artificially interpolatedF0 values do not reflect the actual tone and the criticalvoicing/unvoicing information which is in principle usefulfor recognizing phonetic units is lost. Furthermore, interms of corresponding time window size, the spectral (seg-mental) feature is distinctive in a phonetic or phone seg-ment while the pitch (supra-segmental) feature isembedded in a longer, time window of a word, a phraseor a sentence. By using a two-stream approach we canmodel spectral and pitch features more appropriately thana single stream one (Ho et al., 1999; Seide and and Wang,2000). There are many other approaches to model tone andspectral information separately (Qian et al., 2006; Linet al., 1996; Peng and Wang, 2005; Zhang et al., 2005).The tone features are usually derived from syllable withforce-aligned boundaries and tone models are incorporatedin a post processing stage after the 1st decoding pass. Alonger time window can then be used explicitly to takeneighboring tone information into considerations (Qianet al., 2007). As to integrate the tone information intothe search process, rescoring lattice or N-best lists outputfrom the recognition is usually adopted.

In this paper, we adopt a Multi-Space Distribution(MSD) based tone modeling in Mandarin speech recogni-tion. The MSD was originally proposed by Tokuda et al.to model the discontinuous pitch contours in a statisticalmanner and was successfully applied to HMM-basedspeech synthesis (Tokuda et al., 2002). We extended theMSD model to speaker-independent Mandarin (‘‘Putong-hua”) tone recognition (Wang et al., 2006; Qiang et al.,2007). The tone features and spectral features are furtherseparated into two streams and stream-dependent modelsare built (clustered) in two separated decision trees. TheMSD is seamlessly integrated into the HMM modelingprocess, which is the predominant technique for acousticmodeling in ASR training. The resultant model, so-calledMSD-HMM, is applied naturally to the one-pass viterbidecoding in continuous speech recognition. We test theeffectiveness of the MSD approach in a large vocabulary,continuously read and spontaneous speech database and

further evaluate its robustness in a noisy, continuous Man-darin digit database.

The rest of the paper is organized as follows. In Section2, the approach of MSD for Mandarin Chinese tone mod-eling is reviewed and its application to noisy speech recog-nition is investigated. The experimental results and analysisare shown in Section 3. In Section 4, we give ourconclusion.

2. Mandarin speech recognition with MSD

based tone models

2.1. MSD for tone modeling

Multi-Space Distribution (MSD) was first proposed byTokuda et al. to model stochastically the piece-wise contin-uous F0 trajectory and was applied to HMM-based speechsynthesis (Tokuda et al., 2002). It assumes that the obser-vation space X of an event is made up of G sub-spaces.Each sub-space Xg has its prior probability p(Xg) and allits priors are summed up to one,

PGg¼1pðXgÞ ¼ 1. An obser-

vation vector, o, consists of a set of space indices I and arandom variable x 2 Rn, that is o 2 (I,x), and it is randomlydistributed in each sub-space according to an underlyingpdf, pg(V(o)), where V(o) = x. The dimensionality of theobservation vector can be different in different sub-spaces.The observation probability of o is defined by

bðoÞ ¼X

g2SðoÞpðXgÞpgðV ðoÞÞ ð1Þ

where S(o) = I. The index set of the sub-spaces I thatobservation o belongs to is determined by the extracted fea-tures x at each time instant of observation. A mixture of K

Gaussians can be seen as a special case of MSD, i.e., K sub-spaces of MSD with the same dimensionality and a Gauss-ian distribution in each sub-space. The mixture weightassociated with the kth Gaussian component ck can be re-garded as the prior probability of the kth sub-spaceck = p(Xk).

F0, the fundamental frequency or the pitch, is the mostrelevant feature used in recognizing tonal languages. ButF0, a continuous variable, only exists in the voiced regionof speech signals. In unvoiced segments where no pitchharmonics exist, a discrete random variable is adequateto characterize the un-voicing property. Fig. 1 shows twotonal syllables ‘‘ti2 gan4” (the numerical labels denote theirtone types: tone 2 and tone 4.) in their triphone representa-tion form and F0 contours only span across voiced seg-ments in t-i2+g and g-an4+r. The conventional statisticalmodel can only characterize a feature as either continuousor discrete but not both. Therefore, the discontinuity of F0between voiced and unvoiced segments makes tone model-ing difficult.

MSD is effective to characterize the piece-wise continu-ous F0 contour without resorting to any unnecessary heu-ristic assumptions. In the voiced region, F0 is regarded assequential, one-dimensional observations generated from

Fig. 1. F0 contour of tonal syllable ‘‘ti2 gan4” and a schematic representation of using MSD for tone modeling.

Y. Qian, F.K. Soong / Speech Communication 51 (2009) 1169–1179 1171

several one-dimensional Gaussian sub-spaces, while in theunvoiced region, F0 is treated as a yes–no indicator-like,discrete symbol. We use Gaussian mixtures, the most com-monly used form in speech recognition systems, for charac-terizing the output distributions. The MSD assumes thatthe output pdf of the zero-dimensional, unvoiced sub-spaceis a Kronecker delta function and continuous F0 in theone-dimensional, voiced sub-space has a Gaussian mixturedistribution.

Fig. 1 also gives a schematic representation of MSD-based tone HMM. At the beginning unvoiced part, theconsonant onset ‘‘t”, the mixture weight which representsan unvoiced sub-space is close to one, while the weightsummation of the Gaussian mixture components corre-sponding to the voiced sub-spaces is close to zero. At thevoiced syllable Final ‘‘an4”, the opposite is true. MSD tonemodeling does not need any artificial contour interpolationof F0 trajectories. It models the original F0 features in aprobabilistic manner and no hard decisions are needed.

2.2. Stream-dependent state tying

In LVCSR, context-dependent phone models, e.g., tri-phone models, are commonly used to capture the acousticco-articulation between neighboring phones. To deal withthe data sparseness problem of context-dependent phonesin estimation, model parameters are usually tied together,e.g., state tying via decision-tree clustering is widely usedin current LVCSR systems.

While spectral features like MFCC represent the vocaltract information, tone features reflect the vibration fre-quency or no vibration of the vocal cord. They can be mod-eled through two independent data streams. Moreover, thespectral (segmental) feature is distinctive in a phonetic orphone segment while the pitch (supra-segmental) feature

is embedded in a longer time window of a syllable, a word,a phrase or even a sentence. The co-articulation effects ofspectral and tone features, or their context dependencies,are different (Xu and Liu, 2006). Accordingly, it is morereasonable to perform state tying in the two streams, sepa-rately. We design two question sets corresponding to tonaland phonetic context dependency, respectively. Then deci-sion tree based clustering method is used to automaticallyfind appropriate cluster for state tying. Each tonal syllableis divided into Initial and Tonal Final in the dictionary. Weuse separate decision trees built for each Initial and tonalFinal.

An example of stream-dependent state tying based ondecision-tree clustering is shown in Fig. 2, which illustratesstate tying process performed on state 2 of all tri-phoneswith central phone ‘‘y”. Two decision trees, spectral andpitch trees, are grown for this state by using their ownquestion sets. Going down from the top of the two trees,different questions are used to split the data samples(states). We find that pitch feature stream mainly concernswith questions of tonal context, while the questions forspectral feature stream are more on querying the segmentalcontext.

2.3. MSD based tone modeling in noise

MSD is effective for modeling the piece-wise continuousF0 trajectory of speech signal. But to make the front-endfeature extraction more robust in making correct pitch esti-mates and proper voiced/unvoiced decisions is also impor-tant for providing correct features in low SNRs. Fig. 3shows F0 contour, spectrogram, speech waveform of adigit sequence‘‘9(jiu2)9(jiu3)3(san1)” corrupted by additivestreet noise under 5 dB SNR. Due to the noise contamina-tion, F0 contour of the second digit ‘‘9” (jiu3) cannot be

Fig. 2. An example of stream-dependent state tying based on decision-treeclustering.

Fig. 3. F0 contour, spectrogram, wavform of a digit sequence ‘‘9 9 3” instreet noise at 5 dB SNR.

0 10 20 30 40 50 600

200

400

Frame

F0(

Hz)

with interpolationwithout interpolation

Fig. 4. F0 contours of a digit sequence ‘‘9 9 3” contaminated by streetnoise at 5 dB SNR with/without F0 interpolation.

0 10 20 30 40 50 600

200

400

Frame

F0(H

z)

with interpolationwithout interpolation

Fig. 5. F0 contours of a digit sequence ‘‘9 9 3” (clean) with/without F0interpolation.


successfully extracted and the incorrect raw F0 values areerroneously interpolated. Such interpolation of F0 basedupon the mis-tracked pitch can have a detrimentally nega-tive impact on the recognition performance. However,MSD-based HMM, designed for modeling piecewise con-tinuous F0 contour stochastically, is more robust to noisyF0 feature in the recognition process. Since voiced andunvoiced observations are evaluated with either a continu-ous Gaussian mixture or discrete probabilities, misdetec-tion of pitch can still have negative but not disastrouseffects on the likelihood computation. For example, noF0 in the 2nd digit ‘‘9” is evaluated as a stochastic eventwith a lower probability in MSD. If MSD model is trained

with both clean and noisy data, it will be even more robustto pitch extraction errors.

A popular tone feature preprocessing employs a contin-uation algorithm (Chen et al., 1997) to interpolate the miss-ing F0 values in unvoiced regions. The pitch is interpolatedby running an exponential decay function towards the run-ning average, plus a random noise. The target value ofexponential decay function is usually set to the first F0value in the next voiced segment. The entire F0 contourafter interpolation is done with a low-pass filter. Figs. 4and 5 show the F0 contours of a digit sequence ‘‘993” inclean and noisy conditions, respectively, with/without F0interpolations. The interpolated F0 values depend uponboth the preceding and succeeding F0 values. Conse-quently, the interpolated F0 contour may deviate signifi-cantly from the true F0 values (if they are indeed in avoiced region but missed by the pitch tracking algorithm)or become artificial values in truly unvoiced regions. Fur-thermore, interpolating F0 values in a long unvoiced regioncan become difficult for real-time applications, i.e., gettingthe target F0 value from the next voiced region for interpo-lation without a long look-ahead.

2.4. Tonal syllable recognition

The recognition process of tonal syllables can be rewrit-ten as,

M^¼ arg max

M

Yt

P qtjqt�1ð ÞX

k

cskqtN os

t ; lskqt;Xs

kqt

!" #

�X

k

cpkqtN op

t ; lpkqt;Xp

kqt

!" #; ð2Þ


where M represents a tonal syllable sequence; qt is the stateat time t; and ot is divided into two streams: os

t for the spec-tral feature and op

t for the pitch feature;P

kcskqtN

ost ; l

skqt;Ps

kqt

� �is a mixture of Gaussians of the spectral

features, where cskqt

is the kth mixture weight; whilePkcp

kqtN op

t ; lpkqt;Pp

kqt

� �is a MSD trained by the pitch-

related features with a subspace weight of cpkqt

. In the imple-mentation, we still use mixture Gaussian distribution torepresent MSD, where the output pdf of zero-dimensional,unvoiced subspace is assumed as N(o) = 1, and othervoiced subspaces have Gaussian distributions. At state qt,spectral and pitch features access their own decision treesto fetch corresponding parameters, but they share the sameHMM state transition probability.

3. Experimental results and analysis

3.1. Speech databases

The recognition experiments are performed on aspeaker-independent, large vocabulary, continuously readand spontaneous Mandarin Chinese speech database(BJ2003) and a noisy speech database of connected Chinesedigits (CNDigits). The detailed descriptions of databasesare given in following subsections.

3.1.1. Gender-dependent read and spontaneous speechdatabase

There are a total of 490 speakers (gender balanced) inBJ2003. For each speaker, read speech and spontaneousspeech utterances were recorded. The read speech partwas collected by requesting the speaker to read through aset of Chinese text, including modern novels and classicalChinese writings. For the spontaneous speech part, thespeaker was requested to speak freely on a set of given top-ics, for example, ‘‘Describe your daily life in Beijing”. Thetraining data contains 50 h of read speech and 100 h ofspontaneous speech from 230 male speakers and 230female speakers. Four thousands utterances from theremaining 16 male and 14 female speakers are designatedas testing data.

3.1.2. Noisy speech databaseCNDigits consists of 8000 digit strings for training and

39,480 digit strings for testing. Training set consists ofclean (1600 sentences) and four different kinds of noise:waiting room, street, bus and lounge, four subsets for eachtype of noise, each subset contains 400 sentences from 120female speakers and 200 male speakers, from 5 to 20 dBSNRs, at a step of 5 dB. Testing set consists of four noises(waiting room, street, bus, and lounge) seen in the trainingset (Matched noise) and four additional noises unseen in thetraining set: platform, shop, outside and exit (Mismatched

noise), five subsets for each type of noise, each subset con-tains 987 sentences from 56 female speakers and 102 malespeakers, from 0 to 20 dB SNRs, at a step of 5 dB.

3.2. Experimental set-up

The configurations for experiments are listed as follows:

(1) MFCC-39: 39 MFCC features; one stream; baselinewithout F0 features.

(2) INTP-42-1S: 39 MFCC & F0 + DF0 + DDF0; onestream; interpolated F0 used for unvoiced speech;HMM for F0 modeling; baseline with F0 features.

(3) INTP-42-2S: the same setting as INTP-42-1S excepttwo streams.

(4) MSD-42-2S: 39 MFCC & F0 + DF0 + DDF0; twostreams; no F0 interpolation for unvoiced speech;MSD-HMM for F0 modeling.

(5) MSD-43-2S: MSD-42-2S + normalized duration.(6) MSD-44-2S: MSD-43-2S + long-span pitch.

MFCC-39 is considered as the baseline without F0 fea-tures. It employs the standard 39-dimension MFCC featurevectors. For INTP-42 and MSD-42, the feature vectors areappended with instantaneous log F0 value and first- andsecond-order derivatives. In INTP-42, the F0 features ofunvoiced speech frames are obtained by interpolating withan exponentially decay function (Chen et al., 1997), whichis the conventional method of F0 interpolation for model-ing. INTP-42-1S is the baseline with F0 features, where thetone features and spectral features are modeled in onestream and share the same decision trees, while in INTP-42-2S, the tone features and spectral features are separatedinto two streams and stream-dependent models are built(clustered) in two separated decision trees. For MSD-42-2S, no F0 value is assigned to an unvoiced frame andMSD is used to model such partially continuous featureparameters. MSD-43-2S and MSD-44-2S are the enhancedversions of MSD-42-2S, with the inclusion of a durationfeature and a long-span feature parameter (Zhou et al.,2004). At each frame, the duration feature is computedas the interval length from the starting point (frame) ofthe current voiced segment to the present frame. It is nor-malized with respect to the average tone duration (Zhouet al., 2004). The long-span pitch is computed by normaliz-ing the pitch value with respect to the average pitch overthe last ten frames of the preceding voiced segment.

For each of the configurations, gender-dependent mod-els for read and spontaneous speech are trained separatelyby the corresponding training data, respectively. In thebaseline without F0 features, MFCC-39, all models arecross-word triphone HMMs with three emitting states.The phone set used is Phn187, which contains 187 phones(Initial and tonal Final). The dictionary of tonal syllable tophone is used and no multiple pronunciation variants existin the dictionary. Decision-tree based clustering is appliedto context-dependent state tying and there are about 3000tied states after clustering. Each state has 32 Gaussian mix-ture components. For noisy speech database, a whole-wordHMM was trained for each of the ten Chinese digits (from‘‘0” to ‘‘9” as ‘‘ling2”,‘‘yi1 or yao1”, ‘‘er4”, ‘‘san1”,

Table 1The percentage of F0 tracking errors in vowel segments.

Male (%) Female (%)

Read 22.90 19.44Spontaneous 24.90 21.53


‘‘si4”,‘‘wu3”, ‘‘liu4”, ‘‘qi1”, ‘‘ba1” and ‘‘jiu3”). Each modelconsists of 10 left-to-right states without skipping. Eachstate output pdf is a mixture of three diagonal covarianceGaussians. Since we focus on evaluating the acoustic modelperformance, tonal syllables are used as the output ofdecoding, which itself has useful applications such as Man-darin proficiency test (Zhang et al., 2006). Unconstrainedfree tonal-syllable loop grammar is used for decoding readand spontaneous speech. Free word (digit or tonal syllable)loop is employed in the noisy digit speech decoding.

3.3. The performance of F0 extraction

The main difference between MSD-HMM and the con-ventional interpolation method is that MSD-HMM canpreserve the voiced/unvoiced information along with F0in tone modeling. We evaluate the performance of F0extraction, especially voiced/unvoiced decisions, for noisyand spontaneous speech. Extraction of F0 is done on ashort-time basis by applying the robust algorithm forpitch tracking (RAPT) (Talkin, 1995). The developmentof CNDigits is almost the same as that of Aurora 2(Hirsch and Pearce, 2000). The eight types of noises areadded to clean speech signals at five different SNRs. Eachnoisy digit string has the corresponding clean one. We usethe F0s extracted from clean speech as reference and com-pare the mismatch of unvoiced/voiced decisions betweenclean and noisy digit strings. The percentage ofunvoiced/voiced errors, averaged over matched and mis-matched noises in all SNRs noisy conditions, in F0extraction is shown in Fig. 6, which indicates that the per-formance of F0 tracking degrades significantly withdecreasing SNRs.

For the BJ2003 database, no hand-marked reference canbe used as the grand-truth for evaluating of F0 trackingperformance. As a result, we assume that all frames ofvowel segments, which are obtained from forced alignment,are voiced. The system used for forced alignment is ourbaseline without F0 feature, MFCC-39. Table 1 lists thepercentage of frames, which fail in F0 tracking, over allframes of vowel segments. In the table, it is shown thatpitch tracking performance of read speech is slightly betterthan that of spontaneous speech.

Fig. 6. The percentage of voiced/unvoiced swapping errors at differentSNRs.

3.4. Experimental results

3.4.1. Gender-dependent read and spontaneous speech

database

Table 2 shows the tonal-syllable error rates (TSER)attained by using different configurations for the BJ2003database. MFCC-39 is used here as the baseline for bothspontaneous and read speech. Using two-stream tone mod-eling (INTP-42-2S) can outperform one stream (INTP-42-1S) by 0.25% and 1.4% in absolute TSER reduction forspontaneous and read speech recognition, respectively.For read speech, using MSD-42-2S can improve the TSERperformance from 46.80% and 45.24% to 39.44% and36.61%, i.e., relative TSER reductions of 15.73% and19.08%, for the male and female parts of the databases,respectively. It slightly outperforms INTP-42-2S. Forspontaneous speech, the effectiveness of MSD-42-2S ismore prominent than INTP-42-2S. MSD-42-2S reducesthe absolute TSER by 3.56% and 4.03%, for male andfemale speech, respectively, while only 0.3% and 1.62% cor-responding TSER reductions are obtained by INTP-42-2S.The duration feature (MSD-43-2S) and the long-span pitchfeature (MSD-44-2S) can further improve TSER of bothread and spontaneous speech. Fig. 7 shows the average rec-ognition performance measured in relative TSER reduc-tion, comparing with MFCC-39, in five different featureconfigurations: INTP-42-1S, INTP-42-2S, MSD-42-2S,MSD-43-2S and MSD-44-2S for BJ2003. The maximumimprovement of 23.14% in relative TSER reduction isobtained for read female speech. It also shows that F0 fea-ture effectiveness in improving speech recognition of spon-taneous speech is less than for read speech. It may due tothe low recognition baseline performance in spontaneousspeech.

3.4.2. Noisy speech database

The baseline recognition performance (MFCC-39) inmatched and mismatched noise conditions at variousSNRs is shown in Table 3. It shows that recognition per-formance degrades with decreasing SNRs. The baselinesystem achieves an average 4.1% word (digit or tonal sylla-ble) error rate (WER) in clean condition.

Fig. 8 shows average relative WER reductions, averagedover all noise conditions, of INTP-42-1S, INTP-42-2S,MSD-42-2S, MSD-43-2S and MSD-44-2S, comparing withMFCC-39. Among all five configurations, MSD-42-2Sachieves the best performance. It yields 19.54% and15.39% relative WER reductions averaged over all SNRsfor matched and mismatched noises, respectively. INTP-42 can also improve recognition performance over the

Table 2Tonal-syllable error rate (TSER) of the six configurations for BJ2003.

Spontaneous (%) Read (%)

Male Female Male Female

MFCC-39 69.01 60.55 46.80 45.24INTP-42-1S 68.87 59.26 41.32 39.28INTP-42-2S 68.71 58.93 39.94 37.86MSD-42-2S 65.45 56.52 39.44 36.61MSD-43-2S 64.79 55.91 39.00 35.59MSD-44-2S 63.88 54.83 37.98 34.77


baseline but in a more limited way, compared with MSD-42, and two-stream based tone models (INTP-42-2S) onlyslightly outperforms one-stream modeling (INTP-42-1S).Incorporating more F0 related features, MSD-43-2S andMSD-44-2S have not improved the recognition perfor-

Fig. 7. The average relative TSER reduction of INTP-42-1S, INTP-42-2S, MSD

Table 3Word error rate (WER) using MFCC-39 features on the CNDigits corpus.

Matched noise (%)

Waiting room Street Bus Loun

20 dB 3.86 4.37 3.64 4.2415 dB 4.83 4.59 3.69 4.6510 dB 8.79 5.41 3.96 5.735 dB 17.39 6.87 4.57 9.170 dB 32.53 11.07 6.23 19.05

Fig. 8. The average relative WER reduction of INTP-42-1S, INTP-42-2S, MMFCC-39.

mance further. It may due to the fact that the pitch extrac-tion module fails to detect voiced/unvoiced boundariesproperly in noisy conditions. The duration and long-spanpitch features are less reliable than the instantaneous F0feature. Therefore, the maximum improvement for CNDig-its corpus is obtained from MSD tone modeling. To com-pare the performance of MSD-HMM with interpolated F0based HMM in noisy conditions, we list the detailed num-bers at various SNRs in matched and mismatched noiseconditions in Tables 4 and 5.

The breakdown of recognition performance of INTP-42-2S and MSD-42-2S in clean and different SNRs from20 down to 0 dB noisy conditions is illustrated in Figs. 9and 10. Fig. 9 shows that MSD-based tone modeling cansignificantly improve noisy Chinese digit recognition per-formance at SNRs from 20 to 5 dB. The maximum

-42-2S, MSD-43-2S and MSD-44-2S for BJ2003, comparing with MFCC-39.

Mismatched noise (%)

ge Plat-form Shop Out-side Exit

3.62 4.55 3.63 4.044.00 5.36 4.26 4.525.07 8.71 5.19 5.418.59 16.83 7.58 7.49

17.93 34.48 15.12 14.49

SD-42-2S, MSD-43-2S and MSD-44-2S for CNDigits, comparing with

Table 4WER of INTP-42-2S for CNDigits.

Matched noise (%) Mismatched noise (%)

Waiting room Street Bus Lounge Plat-form Shop Out-side Exit

20 dB 3.86 3.08 2.50 2.98 2.75 3.46 2.59 2.9415 dB 5.37 3.40 2.44 3.38 3.31 4.33 3.21 3.1410 dB 8.39 4.69 2.61 4.70 5.18 7.62 5.09 4.425 dB 18.29 8.83 3.46 10.45 11.81 15.88 11.77 8.490 dB 38.99 20.55 5.12 27.21 27.54 39.09 26.48 22.04

Table 5WER of MSD-42-2S for CNDigits.

Matched noise (%) Mismatched noise (%)

Waiting Room Street Bus Lounge Plat-form Shop Out-side Exit

20 dB 2.81 3.38 2.65 3.19 2.61 3.38 2.63 3.1015 dB 4.33 3.60 2.71 3.48 3.45 4.19 3.03 3.2910 dB 6.73 4.60 2.92 4.26 4.57 7.17 4.71 4.135 dB 14.28 6.70 3.47 7.94 8.41 14.22 8.21 6.540 dB 30.38 13.66 4.59 18.87 19.22 32.15 17.43 14.97

Fig. 9. Relative WER reduction of MSD-42-2S for CNDigits comparingwith MFCC-39.

Fig. 10. Relative WER reduction of INTP-42-2S for CNDigits comparingwith MFCC-39.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

s1 s2 s3 s1 s2 s3State Index

Mix

ture

Wei

ght f

or U

nvoi

ced

Sub-

spac

e

Unvoiced Phone Models

Voiced Phone Models

Fig. 11. Mean of mixture weight for unvoiced sub-space in the states ofunvoiced and voiced phone model.


improvement of 26.01% in relative WER reduction isobtained at 20 dB SNR, averaged over all mismatchednoise conditions. Fig. 9 also shows that the performanceimprovements at SNRs 20 and 10 dB are close to that ofclean speech. However, at 0 dB SNR, which is not includedin the training data, the recognition performance is worsethan that of MFCC-39 in mismatched noise conditions.

It may be due to the fact that at such a low SNR, pitchextraction module fails to track the F0 contour. TheFig. 10 shows recognition performance of F0 interpolationis much worse than the baseline at low SNRs, e.g. 5 and0 dB. It indicates that the interpolation method suffersmore recognition performance loss from deteriorated pitchestimates in those two SNRs.

3.5. Results analysis and discussion

We analyze the mixture weight values of unvoiced sub-space in the states of unvoiced and voiced phones forMSD-HMM trained on male read speech. Their mean val-ues are given in Fig. 11, in which we find the values of state1 and state 3 in unvoiced phone model are lower than thatof state 2, and opposite phenomena are observed in voicedphone model. We think that state 1 and state 3 are in a

Table 6BSER and TER in the three configurations: MFCC-39, INTP-42-2S andMSD-42-2S for BJ2003.

BSER (TER) Spontaneous (%) Read (%)


MFCC-39 53.02 42.80 27.24 25.50(49.09) (44.49) (34.42) (32.40)

INTP-42-2S 54.82 42.81 26.12 24.00(47.16) (41.40) (25.79) (24.14)

MSD-42-2S 50.94 40.00 24.19 21.39(44.46) (40.09) (26.34) (23.87)


transition between unvoiced and voiced segments so theyare less distinct than the central states in term of theirvoiced/unvoiced characteristics.

In Table 2, we notice that MSD-42-2S performs muchbetter in spontaneous speech than in read speech, compar-ing with INTP-42-2S. We further analyze the base syllable(the syllable ignoring tone label) error rate (BSER) andthe tone error rate (TER) of three configurations for BJ2003, as shown in Table 6. Both INTP-42-2S and MSD-42-2S significantly improve the performance of TER overthe baseline MFCC-39 system for spontaneous and readspeech recognition. However, the use of INTP-42-2Sworsens the BSER of spontaneous speech, from the base-line of 53.02% to 54.82% in the case of male speech. The

Table 7The confusion matrix of recognition result using INTP-42-2S under 5 dB SNR

ling yao yi er san

ling 708 1 15 0 0yao 0 66 0 1 0yi 2 3 610 0 0er 0 2 0 687 1san 0 0 1 1 838si 0 0 0 1 4wu 1 0 2 1 0liu 15 2 2 1 0qi 1 0 6 0 0ba 0 0 0 6 1jiu 5 3 0 1 1Ins 1 0 9 2 2

Table 8The confusion matrix of recognition result using MSD-42-2S under 5 SNR lo

ling yao yi er san

ling 730 0 15 0 0yao 0 62 0 4 0yi 3 2 627 0 0er 0 2 0 695 1san 0 0 1 0 846si 0 0 1 0 4wu 1 0 1 0 1liu 13 2 1 1 0qi 0 0 6 0 0ba 0 0 0 4 1jiu 3 1 0 0 1Ins 4 0 8 3 2

high speaking rate, complex co-articulation pattern andpronunciation variation largely degrade the performanceof speech recognition for spontaneous speech (Shinozakiet al., 2001; Fosler-Lussier and Morgan, 1999). Chinesesyllable has Initial–Final structure and most Initials areunvoiced. MSD-based F0 models naturally reversevoiced/unvoiced information so that it can indicate thesyllable boundaries and hence assist to syllable recogni-tion, especially when the spectral models don’t well fitthe testing data.

The recognition error patterns, or the confusion matri-ces, generated by MSD and interpolation-based F0 model-ing, in lounge noise at 5 dB SNR are compared in Tables 7and 8, for MSD-42-2S and INTP-42-2S, respectively. MSDcan significantly reduce digit deletion errors from 6.9% to4.5%. Majority of deletion errors are associated with thesemi-vowel, low pitch digit ‘‘5(wu3)”. The digit ‘‘5” isdeleted frequently due to the fact that it is not well sepa-rated from preceding or succeeding digits, i.e., withoutunvoiced consonants to ‘‘protect” them from being mergedtogether with adjacent digits.

In this paper, we mainly focus on the tonal syllable rec-ognition of Chinese speech for some applications, e.g.Mandarin proficiency test. The performance of ChineseLVCSR is usually measured by the Chinese character errorrate (CER) since the definition of a word in Chinese is

, lounge noise.

si wu liu qi ba jiu Del

1 6 3 2 0 0 550 2 0 0 2 0 01 0 2 10 1 0 771 3 3 0 20 0 3717 0 0 0 2 0 11730 0 0 2 0 0 94 539 0 2 0 1 290

0 5 683 1 0 6 137 0 0 740 0 2 61 0 0 0 821 0 90 5 6 23 0 709 3616 24 1 6 0 0

unge noise.

si wu liu qi ba jiu Del

0 8 7 1 0 1 290 1 3 0 1 0 04 0 2 7 0 0 611 1 2 0 15 0 3717 0 0 0 2 0 4736 0 0 2 0 0 35 648 0 1 0 2 181

0 6 692 1 0 1 119 0 0 743 0 1 30 0 0 0 820 0 132 2 7 19 0 738 1622 22 0 3 0 0

Table 9CER of MFCC-39, INTP-42-1S, MSD-44-2S with bi-gram and tri-gramLMs for male and female read speech subsets of BJ 2003.

Bi-gram (%) Tri-gram (%)


MFCC-39 16.91 12.85 12.09 8.71INTP-42-1S 15.00 11.44 11.15 7.79MSD-44-2S 14.03 10.56 10.55 7.21


somewhat vague. We integrate a language model (LM)with a dictionary of 60k words into the decoding processto test whether our MSD-HMM and two-stream tonemodeling approach is still effective to general LVCSR.The words have an average of 1.1 pronunciations per wordand an average length of 2.3 Mandarin characters in thedictionary. LM was trained on a huge database (1714MMandarin words of texts) including news, novels, poem,and data collected from the World Wide Web. The wordprobabilities are smoothed by good-turning discountingand back-off smoothing. The perplexity of bigram and tri-gram LMs used in decoding are 398 and 280, respectively.Bi-gram LM is employed for the first-pass search, while tri-gram LM is used for rescoring the lattice generated fromfirst-pass. Both bi-gram and tri-gram LMs are used forgender-dependent read speech database. The recognitionresults show that using MSD-HMM and two-stream tonemodeling (MSD-44-2S) can outperform the baseline with-out F0 feature (MFCC-39) by 2.6% and 1.5% of the recog-nition CER reductions with bi-gram and tri-gram LMs,respectively. Comparing with the conventional interpola-tion baseline systems (INTP-42-1S), our system canimprove the CER performance from 13.2% and 9.5% to12.3% and 8.9%, respectively. The breakdown of CERfor male and female read speech subsets of BJ 2003 isshown in Table 9. Since it is not trivial to find properLMs for spontaneous speech and noisy digit speech recog-nition, we didn’t test them in CER performance by usingLMs.

4. Conclusions and future work

We propose to use MSD and two-stream for modelingtone and apply them to speech recognition of tonal lan-guages. The approach is highly effective in modeling thesemi-continuous F0 trajectory and it is distinctive in: (1)modeling the original F0 features without interpolatingthe discontinuous unvoiced regions artificially; (2) model-ing tone and spectral features in two separated streamsand using stream-dependent state tying. The MSD-HMM, two-stream approach achieves a significant perfor-mance improvement of tonal syllable recognition in largevocabulary continuously read and spontaneous Mandarinspeech and noisy, continuous Mandarin digit speech recog-nition. MSD-HMM for tone modeling, which captures theinstantaneous F0 information, is seamlessly integrated intothe one-pass Viterbi decoding of continuous speech recog-

nition. F0 information over a horizontal, longer time spancan also be used to build explicit tone models for rescoringthe decoding lattice in a second pass search. It can furtherimprove the performance of tonal syllable recognitionin addition to that of MSD-HMM, as demonstrated inour experiments on continuous read speech (Wang et al.,2006).

Acknowledgements

The authors appreciate the help of Prof. Keiichi Tokudaand Dr. Heiga Zen, Department of Computer Science, Na-goya Institute of Technology, Nagoya, Japan for providingMSD training tool HTS software via the website: http://hts.ics.nitech.ac.jp/. They also want to thank YutingYeung, Sheng Qiang and Huanliang Wang for their contri-butions to this research during their internships at Micro-soft Research Asia.

References

Chang, E., Zhou, J.L., Di, S., Huang, C., Lee, K.-F., 2000. Largevocabulary Mandarin speech recognition with different approach inmodeling tones. In: Proc. ICSLP 2000, pp. 983–986.

Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K.,1997. New methods in continuous Mandarin speech recognition. In:Proc. Eurospeech 1997, pp. 1543–1546.

Fosler-Lussier, E., Morgan, N., 1999. Effects of speaking rate and wordfrequency on pronunciations in conversational speech. Speech Comm.29, 137–158.

Freij, G.J., Fallside, F., 1988. Lexical stress recognition using hiddenMarkov models. In: Proc. ICASSP 1988, pp. 135–138.

Hirsch, H.G., Pearce, D., 2000. The aurora experimental framework forthe performance evaluation of speech recognition systems under noisyconditions. In: ISCA ITRW ASR.

Hirst, D., Espesser, R., 1993. Automatic modeling of fundamentalfrequency using a quadratic spline function. Travaux de l’Institut dePhonetique d’Aix 15, 71–85.

Ho, T.H., Liu, C.J., Sun, H., Tsai, M.Y., Lee, L.S., 1999. Phonetic statetied-mixture tone modeling for large vocabulary continuous Mandarinspeech recognition. In Proc. EuroSpeech 1999, pp. 883–886.

Lei, X., Siu, M., Hwang, M.-Y., Ostendorf, M., Lee., T., 2006. Improvedtone modeling for Mandarin broadcast news speech recognition. In:Proc. InterSpeech, pp. 1237–1240.

Lin, C.H., Wu, C.H., Ting, P.Y., Wang, H.M., 1996. Frameworks forrecognition of Mandarin syllables with ones using sub-syllabic units.Speech Comm. 18 (2), 175–190.

Peng, G., Wang, W.S.-Y., 2005. Tone recognition of continuous Canton-ese speech based on support vector machines. Speech Comm. 45, 49–62.

Qian, Y., Soong, F.K., Lee, T., 2006. Tone-enhanced generalizedcharacter posterior probability (GCPP) for Cantonese LVCSR. In:Proc. ICASSP 2006, Vol. 1, pp. 133–136.

Qian, Y., Lee, T., Soong, F.K., 2007. Tone recognition in continuousCantonese speech using supratone models. J. Acoust. Soc. Amer. 121(5), 2936–2945.

Qiang, S., Qian, Y., Soong, F.K., Xu, C.-F., 2007. Robust F0 modelingfor Mandarin speech recognition in noise. In: Proc. InterSpeech 2007,pp. 1801–1804.

Seide, F., Wang, N.J.C., 2000. Two-stream modeling of Mandarin tones.In: Proc. ICSLP 2000, pp. 495–498.

Shinozaki, T., Hori, C., Furui, S., 2001. Towards automatic transcriptionof spontaneous presentations. Proc. Eurospeech 2001, 491–494.

http://hts.ics.nitech.ac.jp/

http://hts.ics.nitech.ac.jp/


Talkin, A.D., 1995. Speech Coding and Synthesis, Chapter A RobustAlgorithm for Pitch Tracking (RAPT). Elservier Science B.V.,Amsterdam, pp. 495–518.

Tian, Y., Zhou, J.-L., Chu, M., Chang, E., 2004. Tone recognition withfractionized models and outlined features. In: Proc. ICASSP 1, pp.105–108.

Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T., 2002. Multi-spaceprobability distribution HMM. IEICE Trans. Inf. Systems E85-D (3),455–464.

Wang, H.M., Ho, T.H., Yang, R.C., Shen, J.L., Bai, B.R., Hong, J.C.,Chen, W.P., Yu, T.L., Lee, L.-S., 1997. Recognition of continuousMandarin speech for Chinese language with very large vocabularyusing limited training data. IEEE Trans. Speech Audio Process. 5,195–200.

Wang, H.L., Qian, Y., Soong, F.K., Zhou, J.-L., Han, J.Q., 2006. A multi-space distribution (MSD) approach to speech recognition of tonallanguages. In Proc. ICSLP 2006, pp. 1047–1050.

Wang, H.L., Qian, Y., Soong, F.K., Zhou, J.-L., Han, J.-Q., 2006.Improved Mandarin speech recognition by lattice rescoring withenhanced tone models. In: Proc. ISCSLP, Springer LNAI 4274, pp.445–453.

Xu, Y., Liu, F., 2006. Tonal alignment, syllable structure and coarticu-lation: toward an integrated model. Italian J. Linguist. 18, 125–159.

Zhang, J.-S., Nakamura, S., Hirose, K., 2005. Tone nucleus-based multi-level robust acoustic tonal modeling of sentential F0 variations forChinese continuous speech tone recognition. Speech Comm. 46, 440–454.

Zhang, L., Huang, C., Chu, M., Soong, F.K., Zhang, X., Chen, Y., 2006.Automatic detection of tone mispronunciation in Mandarin. In: Proc.ISCSLP, Springer LNAI 4274, pp. 590–601.

Zhou, J.-L., Tian, Y., Shi, Y., Huang, C., Chang, E., 2004. Tonearticulation modeling for Mandarin spontaneous speech recognition.In: Proc. ICASSP 2004, pp. 997–1000.