4
978-1-4673-0964-6/12/$31.00 ©2012 IEEE 1592 2012 5th International Congress on Image and Signal Processing (CISP 2012) Lattice Generation With Accurate Word Boundary in WFST Framework Yuhong Guo, Yujing Si, Yong Liu, Jielin Pan, Yonghong Yan Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China. Email: {guoyuhong, siyujing, liuyong, jpan, yyan}@hccl.ioa.ac.cn Abstract—This paper presents an algorithm to generate the speech recognition lattice with accurate word boundary in weighted finite-state transducer (WFST) decoding framework. In traditional WFST lattice generation algorithms, the transfor- mation from context-dependent phone lattice to word lattice does not yield accurate time boundaries between words. Meanwhile, this lattice is not a Standard Lattice Format nor is it compatible with existing toolkits. The lattice without word boundary can only be used in the area where the word boundary is not needed. In this paper, we propose a lexicon matching algorithm based on token passing to transform the phone lattice to the word lattice. This algorithm generates standard lattices with accurate word boundary. The experiments show that the proposed lattice generation algorithm has an good lattice quality and good algorithm efficiency. Index Terms—lattice generation, speech recognition, weighted finite-state transducers I. I NTRODUCTION In recent years, the large vocabulary continuous speech recognition (LVCSR) has been applied in many areas including dictation systems, voice search, voice input systems, spoken term detection, spoken dialogue systems and so on. In many cases, such as voice search and spoken term detection, it is desirable to provide not only a single best result but also several alternative choices. Multiple results also create an opportunity for user-machine interaction and a more user- friendly environment. In some other cases, it is too expensive to use large language model (LM) or large acoustic model (AM) in the first pass decoding. It is often desirable to divide the speech decoding procedure into two passes. The speech recognition lattice is a weighted, labeled, directed acyclic graph generated at the decoding procedure to represent the alternative decoding hypothesis. The lattice is also the bridge between the first and the second pass. In a typical two-pass decoding routine, the first pass generates lattice with relatively simple AMs and LMs, the second pass can rescore the lattices with more elaborate models. There are two conventional types of decoding network representations one is based on word-conditioned tree search (WCTS) and the other is based on weighted finite- state transducer (WFST) [1]. Recently, WFST becomes the main representation of decoder networks. It offers a unified framework of all the knowledge sources, such as hidden Markov models (HMMs), context dependent phoneme models, lexical descriptions, and n-gram language models. All these knowledge sources can be integrated into a fully optimized single WFST network with a series of optimization algorithms. However, to fully fulfill the optimization requirement, the WFST network has no specific word end boundary. Therefore, the word lattices generated by the WFST decoders have no accurate word boundary [2]. Such lattices cannot be applied to the situations where the accurate word boundary is required, such as spoken term detection and spoken document retrieval. Although it could be possible to preserve word boundaries if a second pass or acoustic force-align is applied for WFST decoders, the subsequent computational cost would be too heavy. Another paper [3] describes a method which inserts an extra word end transition to each word. This technique increases the final network size and makes the optimization operations incomplete. In [4], an optimized lattice generation algorithm is also introduced. However, the lattice is an HMM- state level lattice. Besides, the word boundary is still not accurate. In this article, we propose a lattice generation algorithm which retains the accurate word boundary in WFST decoding framework. This algorithm utilizes a lexicon matching method based on token passing to generate the final word lattice. The redundancy control is also studied in this article. The exper- iment results show that the redundancy control significantly reduces the redundant calculation and improves the algorithm efficiency. The experiments also show that the lattice has a good quality. The structure of this paper is organized as follows. A short revise of the basics in the lattice generation algorithm of WFST decoders is given in Section II. The exact word lattice generation algorithm is proposed in Section III. Section IV illustrates the experimental results and analysis. Finally, we conclude the whole article in Section V. II. BACKGROUND A. WFST decoding framework A WFST is a weighted finite automaton whose transitions (or edges) are labeled with both input and output symbols [5]. This presentation has both the acceptor property and transducer property. A path through this network (or accepted by the automaton) also maps the input symbol sequences into the output sequences. The path weight represents the

[IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan, China (2012.10.16-2012.10.18)] 2012 5th International Congress on Image and Signal

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan, China (2012.10.16-2012.10.18)] 2012 5th International Congress on Image and Signal

978-1-4673-0964-6/12/$31.00 ©2012 IEEE 1592

2012 5th International Congress on Image and Signal Processing (CISP 2012)

Lattice Generation With Accurate Word Boundaryin WFST Framework

Yuhong Guo, Yujing Si, Yong Liu, Jielin Pan, Yonghong YanKey Laboratory of Speech Acoustics and Content Understanding,

Institute of Acoustics, Chinese Academy of Sciences,Beijing, 100190, China.

Email: {guoyuhong, siyujing, liuyong, jpan, yyan}@hccl.ioa.ac.cn

Abstract—This paper presents an algorithm to generate thespeech recognition lattice with accurate word boundary inweighted finite-state transducer (WFST) decoding framework.In traditional WFST lattice generation algorithms, the transfor-mation from context-dependent phone lattice to word lattice doesnot yield accurate time boundaries between words. Meanwhile,this lattice is not a Standard Lattice Format nor is it compatiblewith existing toolkits. The lattice without word boundary canonly be used in the area where the word boundary is not needed.In this paper, we propose a lexicon matching algorithm basedon token passing to transform the phone lattice to the wordlattice. This algorithm generates standard lattices with accurateword boundary. The experiments show that the proposed latticegeneration algorithm has an good lattice quality and goodalgorithm efficiency.

Index Terms—lattice generation, speech recognition, weightedfinite-state transducers

I. INTRODUCTION

In recent years, the large vocabulary continuous speechrecognition (LVCSR) has been applied in many areas includingdictation systems, voice search, voice input systems, spokenterm detection, spoken dialogue systems and so on. In manycases, such as voice search and spoken term detection, it isdesirable to provide not only a single best result but alsoseveral alternative choices. Multiple results also create anopportunity for user-machine interaction and a more user-friendly environment. In some other cases, it is too expensiveto use large language model (LM) or large acoustic model(AM) in the first pass decoding. It is often desirable to dividethe speech decoding procedure into two passes. The speechrecognition lattice is a weighted, labeled, directed acyclicgraph generated at the decoding procedure to represent thealternative decoding hypothesis. The lattice is also the bridgebetween the first and the second pass. In a typical two-passdecoding routine, the first pass generates lattice with relativelysimple AMs and LMs, the second pass can rescore the latticeswith more elaborate models.

There are two conventional types of decoding networkrepresentations – one is based on word-conditioned treesearch (WCTS) and the other is based on weighted finite-state transducer (WFST) [1]. Recently, WFST becomes themain representation of decoder networks. It offers a unifiedframework of all the knowledge sources, such as hiddenMarkov models (HMMs), context dependent phoneme models,

lexical descriptions, and n-gram language models. All theseknowledge sources can be integrated into a fully optimizedsingle WFST network with a series of optimization algorithms.However, to fully fulfill the optimization requirement, theWFST network has no specific word end boundary. Therefore,the word lattices generated by the WFST decoders have noaccurate word boundary [2]. Such lattices cannot be appliedto the situations where the accurate word boundary is required,such as spoken term detection and spoken document retrieval.Although it could be possible to preserve word boundaries ifa second pass or acoustic force-align is applied for WFSTdecoders, the subsequent computational cost would be tooheavy. Another paper [3] describes a method which insertsan extra word end transition to each word. This techniqueincreases the final network size and makes the optimizationoperations incomplete. In [4], an optimized lattice generationalgorithm is also introduced. However, the lattice is an HMM-state level lattice. Besides, the word boundary is still notaccurate.

In this article, we propose a lattice generation algorithmwhich retains the accurate word boundary in WFST decodingframework. This algorithm utilizes a lexicon matching methodbased on token passing to generate the final word lattice. Theredundancy control is also studied in this article. The exper-iment results show that the redundancy control significantlyreduces the redundant calculation and improves the algorithmefficiency. The experiments also show that the lattice has agood quality.

The structure of this paper is organized as follows. A shortrevise of the basics in the lattice generation algorithm ofWFST decoders is given in Section II. The exact word latticegeneration algorithm is proposed in Section III. Section IVillustrates the experimental results and analysis. Finally, weconclude the whole article in Section V.

II. BACKGROUND

A. WFST decoding framework

A WFST is a weighted finite automaton whose transitions(or edges) are labeled with both input and output symbols[5]. This presentation has both the acceptor property andtransducer property. A path through this network (or acceptedby the automaton) also maps the input symbol sequencesinto the output sequences. The path weight represents the

Page 2: [IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan, China (2012.10.16-2012.10.18)] 2012 5th International Congress on Image and Signal

1593

probabilities or penalties [6]. WFST provides a generic andwell-defined framework to represent the knowledge sources.Moreover, a set of classic operations, such as composition,determination, minimization, weight pushing and so on, arealso developed to compose all the knowledge sources andoptimize the final WFST network. The construction of a finalcontext-dependency phone level (C-level) WFST network canbe expressed as:

F = πϵ(min(det(C̃ ◦ det(L̃ ◦ G̃)))), (1)

where C̃, L̃, G̃ are, respectively, the WFST representationsof context-dependent phonemes, lexicons and n-gram LMswith auxiliary symbols. These auxiliary symbols guarantee thedetermination property of all the knowledge sources. Operators◦, min and det present the WFST composition, minimizationand determination operations [7], [8]. Finally, πϵ replacesall the auxiliary symbols with epsilon transitions. All theseoperations are semi-ring based. In the LVCSR case, tropicsemi-ring and log semi-ring are often used [5]. The tropicsemi-ring is derived from the log semi-ring using the Viterbiapproximation [6] which has less computation cost but lessaccuracy. To achieve higher accuracy in this article, the logsemi-ring is used unless otherwise stated.

B. Phone lattice recording

The decoding procedure is to find a path through the finalC-level WFST network which has the maximum probabilityto a given utterance. The Viterbi beam search is applied. Aphone lattice is recorded while decoding. Each state of thephone lattice corresponds to a pair (t, s) of a time framein the recognition and the network state from the decodingC-level WFST. Each transition of the phone lattice containsall the information the same as the traveled transition in thedecoding WFST with extra information of acoustic scores.During the Viterbi beam search, if one transition e is traveled,one corresponding transition e′ is added to the phone lattice.The start state of e′ is S(e′) = (ts, S(e)) where S(e) is thestart state of e and ts denotes the token entering time. Thedestination state is recorded as D(e′) = (te, D(e)). Therefore,when the decoding procedure is finished, a C-level WFSTlattice is also recorded.

III. LATTICE GENERATION ALGORITHM

The steps described in [2] after the phone lattice recordingare not suitable to generate a word lattice with accurate wordboundary. For one thing, the output label of word is not fixed ata certain phone of its pronunciation. The output labels alwaysfloat around all their possible pronunciation phones, which willbe discussed in the following subsection. Therefore, triviallydeleting all the input phone label does not guarantee that theword has the right time boundary. For another, the standardWFST algorithms, such as ϵ-removal, determination and so on,treat the WFST states with no information. These algorithmsdelete all the time information in the states of phone latticewhen they are applied. To solve these problems, a new latticegeneration algorithm is proposed in this section.

8 9 10

3 4 5

13 14 15

r-eh1+d: eh1-d+*: [sil/sp/ ]:6

11

16

r-eh1+d:red eh1-d+*:

eh1-d+*:redr-eh1+d:

[sil/sp/ ]:

[sil/sp/ ]:

7

2

12

*-r+eh1:red

*-r+eh1:

*-r+eh1:

18 19 20 21eh1-d+*:r-eh1+d: [sil/sp/ ]:red

17*-r+eh1:

Fig. 1. Example of output label shift of the word ”red”.

A. Output label position in WFST

The WFST decoding networks can be fully optimized as awhole huge network. However, it also has a disadvantage thatthere is no accurate word end. The phone lattice generatedwhile decoding has accurate time boundary for each phone,but it is hard to decide the word boundary. This is mainlybecause the output label is not fixed at a certain phone of theword.

In constructing the lexicon WFST L̃, the word output labelis always put at the first phone of the word. This could benefitthat, while composing L̃ to G̃, the quick match of the outputlabel of L̃ and the input label G̃ could highly reduce thememory and time cost. According to [6], the composition withϵ transition would undergo a filter. This does not change theoutput label position. The minimization operation is suggestedto be operated in the finite-state machine form [9], but it doesnot change the output label either. Only the determinationoperation changes the output label position. However, thechange of the output label does not exceed the phones of aword. For example, the word ”red” has 3 phones: r, eh1 andd. The output label of ”red” can float around all these phones,as illustrated in Fig. 1. Sometimes, some ϵ transitions may beinserted between the phones, the output label can also be inthese inserted ϵ transitions. The proposed algorithm is basedon this property.

B. Phone lattice to word lattice converting

A token passing based phone-to-word lattice convertingalgorithm is proposed in this subsection. A pronunciationdictionary must be used during the converting procedure.The dictionary is used to align the input phone labels withthe output word labels. Tokens are used to travel the wholephone lattice to complete the converting by the depth firstsearch (DFS). Based on the property mentioned previously,each token only needs to contain the phones of the currentlytraveling word. Other messages such as acoustic scores, LMscores, the output word labels and previous states in wordlattice are also recorded in tokens, depicted in Fig. 2.

A stack is used to record the phones in the token. Everytime the token travels a transition, the phone label of thetransition is pushed into this stack, e.g., the Token 1 inFig. 2. When the token passes a transition with output wordlabel, the pronunciations are filled into the token. And themultiple pronunciation must be considered, e.g., the Token 2in Fig. 2. When input phones in the stack match one of the

Page 3: [IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan, China (2012.10.16-2012.10.18)] 2012 5th International Congress on Image and Signal

1594

r: eh1:

d:read

d:reading

ih0:

ng:

LM Score

AM Score Prev Word

Lattice State

Pronunciation

Mono-Phone Stack

r eh1 d

r eh1 dr iy1 d

LM Score

AM Score Prev Word

Lattice State

Pronunciation

Mono-Phone Stack

r eh1 d

r eh1 d ih0 ng

ih0 ng

Token 2

Token 1

Token 3

Token 4

. . .

LM Score

AM Score Prev Word

Lattice State

Pronunciation

Mono-Phone Stack

r eh1 d

r eh1 d ih0 ng

LM Score

AM Score Prev Word

Lattice State

Pronunciation

Mono-Phone Stack

r eh1

Generate a

new edge

Generate a

new edge

Generate a

new edge

Generate a

new edge

s1,t1

s3,t3

s2,t2

s4,t4

s5,t5

s6,t6

Fig. 2. Proposed procedure of token passing.

s3,t3

s2,t2

s1,t1

Generate a

new edge

Generate a

new edge

.

.

.

Generate a

new edge

Generate a

new edge ×

1st travel

2nd travel

1st - 1

1st - n

No further

travel

Fig. 3. Example of redundant traveling control.

pronunciations, the token can clean up the stack and generatea new transition in the word lattice. Finally the pointer to theprevious state is renewed to the newly generated state. This isshowed in Token 2 and Token 4 in Fig. 2. The token keepstraveling forward along the phone lattice if the pronunciationis not matched. The generation of word lattice is completedwhen all the tokens finish traveling and reach the final state.

C. Optimization in redundancy control

The redundant computation can be a great concern if theDFS token passing is not controlled. The calculation com-plexity is O(nS ·nT ) where nS and nT present the number oftotal states and transitions in the phone lattice. For example,

if a state has m in transitions and n out transitions, all then out transitions will be traveled when a token reaches thisstate. Finally, each of the out transitions is traveled m times,causing the algorithm extremely redundant.

To solve this problem, a traveling control method is devel-oped, illustrated in Fig. 3. The DFS has a property that, all thepaths outgoing from one state are traveled completely whenthe state has been traveled for the first time. To make useof this property, the traveling control method is developed asfollows. First, if a token reaches a state and generates a newtransition in the word lattice, a pop-flag is marked in the state.Second, a second traveling-flag in the state is marked whenthis token finishes traveling all the out transitions from thisstate by DFS. Third, if the marked state is traveled by othertokens and the tokens also generate new transitions in wordlattice, the following token passing operation can be judged asredundant. At this time, the token passing can be terminated.It is important that the traveling control is operated at thestate where the token generates a new transition in the wordlattice. For one thing, there could be dangling states in theword lattice if the token passing control is applied in all thestates. For another, the state where the token can generate anew transition often has large amount of out transitions. Thisis because the state is the real word end. There are manyoutgoing words in the language model level. Applying theredundancy control in this kind of states is very efficient. Thelattice calculation complexity can be reduced to O(nT ).

IV. EXPERIMENT RESULT AND ANALYSIS

All the experiments are carried out on the corpus which ispublished by Chinese governmental research program 863. Itis about 11 hours, including 9692 utterances from fourteenspeakers (7 females and 7 males). A 43k-word lexicon and a3-gram language model built with SRILM tools [10] are usedin decoding, containing 33-million 3-gram and 29-million 2-gram. Two acoustic models are used in the following exper-iments. One is a fast acoustic model containing 5884 sharedHMM states and 12-component Gaussian mixture models(GMMs). The other one is an elaborate acoustic model with20325 shared HMM states and 40-component GMMs.

A. Redundancy control

TABLE IRESULT OF THE REDUNDANCY CONTROL.

Average # of iteration Real time (xRT)without Opt 1262759.2 20.0

with Opt 1181.2 0.062

All the utterances are decoded two times in this test withthe fast acoustic model. The first time they are decoded by thelattice algorithm without optimization in redundancy control.The second time is with the optimization as a comparison.The average iterative times of the traveling function and theaverage decoding real time are both recorded in TABLE I. Thenumber of the iteration is cut off by more than 1000 times with

Page 4: [IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan, China (2012.10.16-2012.10.18)] 2012 5th International Congress on Image and Signal

1595

the redundancy control technique. The result also proves thatthe calculation complexity of this algorithm is reduced fromO(nS · nT ) to O(nT ). With this technique, the efficiency ofthe proposed lattice algorithm is significantly improved.

B. Lattice generation efficiency

TABLE IILATTICE GENERATION EFFICIENCY TEST ON DIFFERENT ACOUSTIC

MODELS

AM Recording Converting Total12-GMM 12.28% 17.43% 29.71%

40-GMM 7.42% 3.51% 10.93%

This experiment is carried out on both the two acousticmodel. The elaborate model is for slow system which hasslower speed with higher decoding accuracy, and vice versafor the fast model. Both the phone lattice recording and phone-to-word lattice converting time is recoded by the ratio ofadditional computation time above the one-pass decoding,as showed in the TABLE II. The total time required withelaborate acoustic model is less than the fast model. Thereason is that more acoustic computation is needed in elaboratemodel, so the lattice time is relatively less. The phone latticerecording time matches the description in [2] around 10%.However, the phone-to-word converting time differs a lot in thetwo cases. This is because the pruning strategy can be appliedefficiently with the help of the elaborate acoustic model.Moreover, the recorded paths are more discriminative withless confusion than the fast model. Therefore, the convertingprocess costs less time.

C. Lattice quality

To focus on the lattice quality and the LM rescoring, thefast acoustic model is used to emphasize the language modelinstead of the acoustic model in this experiment. We comparethe lattice quality with the lattice generated by a WCTSdecoder described in [11]. The lattice character error rate(LCER) which is the lowest character error rate is used tomeasure the lattice quality. A 5-gram LM with 447 millionn-gram is also used to rescore the lattice by lattice-tool [10].The result is illustrated in Fig. 4. The oracle 1-best decodingresult of the proposed decoder is slightly better than the WCTSdecoder. However, from the LM rescoring result, the proposedlattice can obtain higher quality. The LM rescoring of theproposed lattice decreases 1.9 1-best oracle CER comparing to1.3 of the WCTS lattice. Moreover, the oracle lattice error rateof the proposed lattice is 31.3 while the oracle lattice charactererror rate of WCTS is 33.5. The proposed lattice has a betterquality than the lattice of the comparative WCTS decoder.

V. CONCLUSION

This paper has reported an investigation of word latticegeneration in WFST decoding framework. We proposed aword lattice generation algorithm based on the existing phonelattice. The proposed algorithm can preserve accurate word

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.430

35

40

45

50

55

60

65

70

Real Time

Char

acte

r E

rror

Rat

e (%

)

WFST 1Best

WFST Rescore

WFST LCER

WCTS 1Best

WCTS Rescore

WCTS LCER

Fig. 4. Comparison of the proposed lattice to the WCTS lattice.

boundaries and is compatible with the existing subsequentlattice processing tools. The experimental results show thatthe proposed lattice has better quality than the WCTS lattice.The future work will be focused on whether this algorithmcan be applied in on-the-fly case.

ACKNOWLEDGEMENTS

This work is partially supported by the National NaturalScience Foundation of China (Nos. 10925419, 90920302,61072124, 11074275, 11161140319) and the Strategic PriorityResearch Program of the Chinese Academy of Sciences (GrantNos. XDA06030100, XDA06030500).

REFERENCES

[1] S. Kanthak, H. Ney, M. Riley, and M. Mohri, “A comparison of two lvrsearch optimization techniques,” in International Conference on SpokenLanguage Processing, 2002.

[2] A. Ljolje, F. Pereira, and M. Riley, “Efficient general lattice generationand rescoring,” 1999, pp. 1251–1254.

[3] D. Rybach, R. Schluter, and H. Ney, “A comparative analysis of dynamicnetwork decoding,” in IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2011, pp. 5184–5187.

[4] D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal,M. Janda, M. Karafiat, S. Kombrink, P. Motlicek, and Y. Qian, “Gen-erating exact lattices in the wfst framework,” in IEEE InternationalConference on International Conference on Acoustic, Speech, and SignalProcessing (ICASSP). 2012, pp. 4213–4216, IEEE.

[5] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducersin speech recognition,” Computer Speech and Language, vol. 16, no. 1,pp. 69–88, 2002.

[6] M. Mohri, F. Pereira, and M. Riley, Speech recognition with weightedfinite-state transducers, chapter 28, pp. 559–583, Springer, 2008.

[7] M. Riley, F. Pereira, and M. Mohri, “Transducer composition forcontext-dependent network expansion,” in 5th European Conference onSpeech Communication and Technology, 1997, vol. 3, pp. 1427–1430.

[8] M. Mohri and M. Riley, “Weighted determinization and minimizationfor large vocabulary speech recognition,” in 5th European Conferenceon Speech Communication and Technology, 1997, vol. 1, pp. 131–134.

[9] C. Allauzen, M. Mohri, M. Riley, and B. Roark, “A generalized construc-tion of integrated speech recognition transducers,” in IEEE InternationalConference on International Conference on Acoustic, Speech, and SignalProcessing (ICASSP), 2004, vol. 1, pp. 761–764.

[10] A. Stolcke, “Srilm-an extensible language modeling toolkit,” inInternational Conference on Spoken Language Processing, 2002, vol. 2,pp. 901–904.

[11] S. Jian, LI Ta, Q. Zhang, Z. Qingwei, and YAN Yonghong, “A one-pass real-time decoder using memory-efficient state network,” IEICETRANSACTIONS on Information and Systems, vol. 91, no. 3, pp. 529–537, 2008.