MultitaskLearningwithLocalAttentionforTibetan …...2020/09/22 · diﬀerentpositionswithinWaveNet,i.e.,intheinputlayer andhighlayer,respectively. econtributionofthisworkisthree-fold.Forone,we

Research ArticleMultitask Learning with Local Attention for TibetanSpeech Recognition

Hui Wang Fei Gao Yue Zhao Li Yang Jianjian Yue and Huilin Ma

School of Information Engineering Minzu University of China Beijing 100081 China

Correspondence should be addressed to Fei Gao 18301390muceducn

Received 22 September 2020 Revised 12 November 2020 Accepted 26 November 2020 Published 18 December 2020

Academic Editor Ning Cai

Copyright copy 2020 Hui Wang et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In this paper we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speechrecognition in multitask learning With an increase in task number such as simultaneous Tibetan speech content recognitiondialect identification and speaker recognition the accuracy rate of a single WaveNet-CTC decreases on speech recognitionInspired by the attention mechanism we introduce the local attention to automatically tune the weights of feature frames in awindow and pay different attention on context information for multitask learninge experimental results show that our methodimproves the accuracies of speech recognition for all Tibetan dialects in three-task learning compared with the baseline modelFurthermore our method significantly improves the accuracy for low-resource dialect by 511 against the specific-dialect model

1 Introduction

Multitask learning has been applied successfully for speechrecognition to improve the generalization performance ofthe model on the original task by sharing the informationbetween related tasks [1ndash9] Chen and Mak [6] used themultitask framework to conduct joint training of multiplelow-resource languages exploring the universal phonemeset as a secondary task to improve the effect of the phonememodel of each language Krishna et al [7] proposed a hi-erarchical multitask model and the performance differencesbetween high-resource language and low-resource languagewere compared Li et al [8] and Toshniwal et al [9] in-troduced additional information of language ID to improvethe performance of end-to-end multidialect speech recog-nition systems

Tibetan is one of minority languages in China It hasthree major dialects in China ie U-Tsang Kham andAmdoere are also several local subdialects in each dialectTibetan dialects pronounce very differently but the writtencharacters are unified across dialects In our previous work[10] Tibetan multidialect multitask speech recognition wasconducted based on the WaveNet-CTC which performedsimultaneous Tibetan multidialect speech content

recognition dialect identification and speaker recognitionin a single model WaveNet is a deep generative model withvery large receptive fields and it can model the long-termdependency of speech data It is very effective to learn theshared representation from speech data of different tasksus WaveNet-CTC was trained on three Tibetan dialectdata sets and learned the shared representations and modelparameters for speech recognition speaker identificationand dialect recognition Since the Lhasa of U-Tsang dialect isa standard Tibetan speech there are more corpora availablefor training than Changdu-Kham and Amdo pastoral dia-lect Although two-task WaveNet-CTC improved the per-formance on speech recognition for Lhasa of U-Tsang dialectand Changdu-Kham dialect the three-task model did notimprove performance for all dialects With an increase intask number the speech recognition performance degraded

To obtain a better performance attention mechanism isintroduced intoWaveNet-CTC for multitask learning in thispaper Attention mechanism can learn to set larger weight tomore relevant frames at each time step Considering thecomputation complexity we conduct a local attention usinga sliding window on the whole of speech feature frames tocreate the weighted context vectors for different recognitiontasks Moreover we explore to place a local attention at the

HindawiComplexityVolume 2020 Article ID 8894566 10 pageshttpsdoiorg10115520208894566

different positions within WaveNet ie in the input layerand high layer respectively

e contribution of this work is three-fold For one wepropose the WaveNet-CTC with local attention to performmultitask learning for Tibetan speech recognition which canautomatically capture the context information among dif-ferent tasks is model improves the performance of theTibetan multidialect speech recognition task Moreover wecompared the performance of local attention inserted atdifferent positions in the multitask model e attentioncomponent embedded in the high layer of WaveNet obtainsbetter performance than the one in the input layer ofWaveNet for speech recognition Finally we conduct asliding window on the speech frames for efficiently com-puting the local attention

e rest of this paper is organized as follows Section 2introduces the related work Section 3 presents our methodand gives the description of the baseline model local at-tention mechanism and the WaveNet-CTC with local at-tention In Section 4 the Tibetan multidialect data set andexperiments are explained in detail Section 5 describes ourconclusions

2 Related Work

Connectionist temporal classification (CTC) for end-to-endhas its advantage of training simplicity and is one of themostpopular methods used in speech recognition Das et al [11]directly incorporated attention modelling within the CTCframework to address high word error rates (WERs) for acharacter-based end-to-end model But in Tibetan speechrecognition scenarios the Tibetan character is a two-di-mensional planar character which is written in Tibetanletters from left to right besides there is a vertical super-position in syllables so a word-based CTC is more suitablefor the end-to-end model In our work we try to introduceattention mechanism in WaveNet as an encoder for theCTC-based end-to-end model e attention is used inWaveNet to capture the context information among dif-ferent tasks for distinguishing dialect content dialectidentity and speakers

In multitask settings there are some recent works fo-cusing on incorporating attention mechanism in multitasktraining Zhang et al [12] proposed an attention mechanismfor the hybrid acoustic modelling framework based onLSTM which weighted different speech frames in the inputlayer and automatically tuned its attention to the splicedcontext input e experimental results showed that at-tention mechanism improved the ability to model speechLiu et al [13] incorporated the attention mechanism inmultitask learning for computer vision tasks in which themultitask attention network consisted of a shared networkand task-specific soft-attention modules to learn the task-specific features from the global pool whilst simultaneouslyallowing for features to be shared across different tasksZhang et al [14] proposed an attention layer on the top ofthe layers for each task in the end-to-end multitaskframework to relieve the overfitting problem in speechemotion recognition Different from the works of Liu et al

and Zhang et al [13 14] which distributed many attentionmodules in the network our method merely uses one slidingattention window in the multitask network and has itsadvantage of training simplicity

3 Methods

31 Baseline Model We take the Tibetan multitask learningmodel in our previous work [10] as the baseline model asshown in Figure 1 which was initially proposed for Chineseand Korean speech recognition from the work of Xu [15] andKim and Park [16] e work [10] integrates WaveNet [17]with CTC loss [18] to realize Tibetan multidialect end-to-end speech recognition

WaveNet contains the stacks of dilated causal con-volutional layers as shown in Figure 2 In the baseline modelthe WaveNet network consists of 15 layers which aregrouped into 3 dilated residual blocks of 5 layers In everystack the dilation rate increases by a factor of 2 in everylayer e filter length of causal dilated convolutions is 2According to equations (1) and (2) the respective field ofWaveNet is 46

Receptive fieldblock 1113944

n

i1Filterlength minus 11113872 1113873 times Dilationratei

+ 1

(1)

Receptive fieldstacks S times Receptive fieldblock minus S + 1 (2)

In equations (1) and (2) S refers to the number of stacksReceptive fieldblock refers to the receptive field of a stack ofdilated CNN Receptive fieldstacks refers to the receptive fieldof some stacks of dilated CNN and Dilationratei

refers to thedilation rate of the i-th layer in a block

WaveNet also uses residual and parameterized skipconnections [19] to speed up convergence and enabletraining of much deeper models More details aboutWaveNet can be found in [17]

Connectionist temporal classification (CTC) is an al-gorithm that trains a deep neural network [20] for the end-to-end learning task It can make the sequence label pre-dictions at any point in the input sequence [18] In thebaseline model since the Tibetan character is a two-di-mensional planar character as shown in Figure 3 the CTCmodeling unit for Tibetan speech recognition is Tibetansingle syllable otherwise a Tibetan letter sequence from leftto right is unreadable

32 Local Attention Mechanism Since the effect of eachspeech feature frame is different for the target label output atcurrent time considering the computational complexity weintroduce the local attention [21] into WaveNet to create aweighted context vector for each time i e local attentionplaces a sliding window with the length 2n centered aroundthe current speech feature frame on the input layer andbefore the softmax layer in WaveNet respectively and re-peatedly produces a context vector Ci for the current input(or hidden) feature frame x(h)ie formula for Ci is shown

2 Complexity

in equation (3) and the schematic diagram is shown inFigure 4

Ci 1113944i+n

jiminusnjneiαij middot x(h)j (3)

where αij is the attention weight subject to αge 0 and1113936jαij 1 through softmax normalization e αij calcu-lation method is as follows

αij exp Score x(h)i x(h)j1113872 11138731113872 1113873

1113936jexp Score x(h)i x(h)j1113872 11138731113872 1113873 (4)

It captures the correlation of speech frame pair(x(h)i x(h)j jne i) e attention operates on n framesbefore and after the current frame Score () is an energyfunction whose value is computed as equation (5) by theMLP which is jointly trained with all the other componentsin an end-to-end network ose x(h)j jne i that get largerscores would have more weights in context vector Ci

Score xi xj1113872 1113873 vTa tanh Wa x(h)i x(h)j1113960 11139611113872 1113873 (5)

Finally x(h)i is concatenated with Ci as the extendedfeature frame and fed into the next layer of WaveNet asshown in Figures 5 and 6e attentionmodule is inserted inthe input layer in Figure 5 referred as Attention-WaveNet-CTC e attention module is embedded before the softmaxlayer in Figure 6 referred as WaveNet-Attention-CTC

4 Experiments

41 Data Our experimental data are from an open and freeTibetan multidialect speech data set TIBMDMUC [10] inwhich the text corpus consists of two parts one is 1396spoken language sentences selected from the book ldquoTibetanSpoken Languagerdquo [22] written by La Bazelen and the otherpart contains 8000 sentences from online news electronicnovels and poetry of Tibetan on internet All text corpora inTIBMDMUC include a total of 3497 Tibetan syllables

ere are 40 recorders who are from Lhasa City in TibetYushu City in Qinghai Province Changdu City in Tibet andTibetan Qiang Autonomous Prefecture of Ngawaey useddifferent dialects to speak out the same text for 1396 spokensentences and other 8000 sentences are read loudly in Lhasadialect Speech data files are converted to 16K Hz samplingfrequency 16 bit quantization accuracy and wav format

Our experimental data for multitask speech recognitionare shown in Table 1 which consists of 44 hours Lhasa-U-Tsang 190 hours Changdu-Kham and 328 hours Amdopastoral dialect and their corresponding texts contain 1205syllables for training We collect 049 hours Lhasa-U-Tsang019 hours Changdu-Kham and 037 hours Amdo pastoraldialect respectively to test

39 MFCC features of each observation frame areextracted from speech data using a 128ms window with96ms overlaps

e experiments are divided into two parts two-taskexperiments and three-task experiments ree dialect-specific models and a multi-dialect model without attentionare trained on WaveNet-CTC

In WaveNet the number of hidden units in the gatinglayers is 128 e learning rate is 2times10minus4 e number ofhidden units in the residual connection is 128

42 Two-task Experiment For two-task joint recognition theperformances of the dialect ID or speaker ID at the beginningand at the end of output sequence were evaluated respectivelyWe set n 5 frames before and after the current frame tocalculate the attention coefficients for attention-based Wave-Net-CTC which are referred to as Attention (5)-WaveNet-CTCand WaveNet-Attention (5)-CTC respectively for the twoarchitectures in Figures 5 and 6 Compared with the calculationof the attention coefficient of all frames the calculation speed oflocal attention has been improved quickly which is convenientfor the training of models

e speech recognition result is summarized in Table 2e best model is the proposed WaveNet-Attention-CTCwith the attention embedded before the softmax layer inWaveNet and dialect ID at the beginning of label sequenceIt outperforms the dialect-specific model by 739 and 24respectively for Lhasa-U-Tsang and Changdu-Kham andgets the SER close to the dialect-specific model for AmdoPastoral which has the highest ARSER (average relativesyllable error rate) for three dialects emodel of dialectID-speech (D-S) in the framework of WaveNet-Attention-CTCis effective to improve multilinguistic speech content rec-ognition Speech content recognition is more sensitive to the

Speech wave

MFCC

WaveNet

CTC loss

[Speaker ID] [Dialect ID]

[Speaker ID] [Dialect ID]or

Figure 1 e baseline model

Complexity 3

xT~xTndash15

Stack 3

Stack 2

Stack 1

Hidden layerdilation = 4



Input

Outputdilation = 8




Input

Outputdilation = 8




Input

Outputdilation = 8

xTndash15 xTndash14 xTndash13 xTndash12 xTndash11 xTndash10 xTndash9 xTndash8 xTndash7 xTndash6 xTndash5 xTndash4 xTndash3 xTndash2 xTndash1 xT

xT~xTndash30

xT~xTndash45

Figure 2 3 stacks of 5 dilated causal convolutional layers with filter length 2

Vowel (u)

Vowel (i e o)

Post-

posts

crip

t

Posts

crip

t

Pres

crip

t

Root

Superscript

Subscript

Figure 3 e structure of a Tibetan syllable

4 Complexity

recognition of dialect ID than speaker IDe recognition ofdialect ID helps to identify the speech content However theattention inserted before the input layer inWaveNet resultedin the worst recognition which shows that raw speechfeature cannot provide much information to distinguish themultitask

For dialect ID recognition in Table 3 we can see thatthe model with attention mechanism added before thesoftmax layer performs better than which is added in theinput layer and the dialect ID at the beginning is betterthan that at the end From Table 2 and Table 3 it can beseen that the dialect ID recognition influences the speechcontent recognition

We also test the speaker ID recognition accuracy for thetwo-task models Results are listed in Table 4 It is worthnoting that the Attention-WaveNet-CTC model performspoorly on both tasks of the speaker and speech contentrecognition Especially in the speaker identification task therecognition rate of the speakerID-speech model in all threedialects is very poor Among the Attention-WaveNet-CTCmodels it can be seen that the modelling ability of twomodels of the dialectID-speech and speakerID-speechmodelshows big gap which means the Attention-WaveNet-CTCarchitecture cannot learn effectively the correlation amongmultiple frames of acoustic feature for multiple classificationtasks In contrast the WaveNet-Attention-CTC model has a

+ 1 times 1 1 times 1 Softmax

Sentence label (character level)

CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

Xjndashn Xj+n

Attentionmechanism

Concatenation

Xj

tanh tanh

Figure 5 e architecture of attention-WaveNet-CTC

αiindashn

xindashn

Ci

lowast

αiindash1

xindash1

lowast

αii+1

xi+1

lowast

αii+n

xi+n

lowast

Figure 4 Local attention

Complexity 5

much better performance on the two tasks e attentionembedded before the softmax layer can find the related andimportant frames to lead to high recognition accuracy

43ree-task Experiment We compared the performancesof two architectures namely Attention-WaveNet-CTC andWaveNet-Attention-CTC on three-task learning with thedialect-specific model and WaveNet-CTC where we eval-uated n 5 n 7 and n 10 respectively for the attentionmechanism e results are shown in Table 5

We can see that the three-task models have worseperformance compared with the two-task model andWaveNet-Attention-CTC has lower SERs for Lhasa-U-Tsang and Amdo Pastoral against the dialect-specificmodel but for Changdu-Kham a relative low-resourceTibetan dialect the model of dialectID-speech-speakerID(D-S-S2) based on the framework of WaveNet-Attention

(10)-CTC achieved the highest recognition rate in all modelswhich outperforms the dialect-specific model by 511 Weanalyzed the reason that maybe is the reduction of gener-alization error of the multitask model with the number oflearning tasks increasing It improves the recognition ratefor small-data dialect however not for big-data dialectsSince ASER reflects the generalization error of the modelD-S-S2 of WaveNet-Attention (10)-CTC has highest ASERin all models which shows it has better generalization ca-pacity Meanwhile WaveNet-Attention (10)-CTC achievedthe better performance than WaveNet-Attention (5)-CTCand WaveNet-Attention (7)-CTC for speech content rec-ognition as shown in Figure 7 where the syllable error ratesdeclined with the number of n increasing for three dialectsand Changdu-Khamrsquos SER has a quickest descent We canconclude that attention mechanism needs a longer range todistinguish more tasks and it pays more attention on the

+ 1 times 1 1 times 1

Softmax


CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

hjhjndashn hj+n

Attentionmechanism

Concatenation

tanh tanh

Figure 6 e architecture of WaveNet-attention-CTC

Table 1 e experimental data statistics

Dialect Training data (hours) Training utterances Test data (hours) Test utterances SpeakerLhasa-U-Tsang 440 6678 049 742 20Changdu-Kham 190 3004 019 336 6Amdo pastoral 328 4649 037 516 14Total 958 14331 105 2110 40

6 Complexity

low-resource task It is also observed that WaveNet-At-tention (5)-CTC has better performance than Attention (5)-WaveNet-CTC which demonstrates again that the attentionmechanism placed in the high layer can find the related andimportant information which leads to more accurate speechrecognition than when it is put in the input layer

From Tables 6 and 7 we can observe that models withattention have worse performance than the ones withoutattention for dialect ID recognition and speaker ID rec-ognition and longer attention achieved the worse rec-ognition for the language with large data It also showsthat in the case of more tasks the attention mechanism

tends towards the low-resource task such as speechcontent recognition

In summary combining the results of the above experi-ments whether two task or three task the multitask model canmake a significant improvement on the performance of thelow-resource task by incorporating the attention mechanismespecially when the attention is applied to the high-level ab-stract features e attention-based multitask model canachieve the improvements on speech recognition for all dialectscompared with the baselinemodelWith an increase in the tasknumber the multitask model needs to increase the range forattention to distinguish multiple dialects

Table 4 Speaker ID recognition accuracy () of two-task models

Architecture Model Lhasa-U-Tsang Changdu-Kham Amdo PastoralSpeakerID model 6775 9313 9531

WaveNet-CTC with speaker ID S-S1 6832 9285 9748S-S2 7115 9523 9612

Attention (5)-WaveNet-CTC S-S1 0 0 0S-S2 6064 7738 8585

WaveNet-Attention (5)-CTC S-S1 7035 9285 9748S-S2 6940 100 9670

Table 2 Syllable error rate () of two-task models on speech content recognition

Architecture ModelLhasa-U-Tsang

Changdu-Kham Amdo Pastoral

SER1 RSER2 SER RSER SER RSER ASER3

Dialect-specific model 2883 6256 176WaveNet-CTC 2955 minus072 6283 minus027 3352 minus1592 minus563

WaveNet-CTC with dialect ID or speaker ID (baseline model)

D-S4 3284 minus401 6858 minus602 3300 minus1540 minus848S-D5 2680 203 6403 minus147 3079 minus1309 minus421S-S16 2721 162 6417 minus161 2968 minus1208 minus402S-S27 2813 07 6243 013 2804 -1044 minus320

Attention (5)-WaveNet-CTC

D-S 5219 minus2336 6524 minus268 5022 -3262 minus1955S-D 5516 minus2633 6778 minus522 5523 -3763 minus2306S-S1 7742 minus4859 8544 minus2288 8208 -6448 minus4532S-S2 8332 minus5449 8915 minus2694 8147 -6387 minus4843

WaveNet-Attention (5)-CTC

D-S 2144 739 6016 240 2046 minus286 231S-D 2379 504 6296 minus04 2415 minus655 minus064S-S1 3486 minus603 6336 minus08 4010 minus2250 minus978S-S2 3483 minus600 6270 minus014 3763 minus2003 minus872

1SER syllable error rate 2RSER relative syllable error rate 3ARSER average relative syllable error rate 4D-S the model trained using the transcription withdialect ID at the beginning of target label sequence like ldquoA ཐགས ར ཆrdquo 5S-D the model trained using the transcription with dialect ID at the end of targetlabel sequence 6S-S1 the model trained using the transcription with speaker ID at the beginning of target label sequence and 7S-S2 the model trained usingthe transcription with speaker ID at the end of target label sequence

Table 3 Dialect ID recognition accuracy () of two-task models

Architecture Model Lhasa-U-Tsang Changdu-Kham Amdo PastoralDialectID model 9788 9224 979

WaveNet-CTC with dialect ID D-S 9857 9523 996S-D 9901 9761 9941

Attention (5)-WaveNet-CTC D-S 100 8928 9452S-D 0 0 0

WaveNet-Attention (5)-CTC D-S 100 988 9941S-D 100 9404 9806

Complexity 7

Table 5 Syllable error rate () of three-task models on speech content recognition



SER RSER SER RSER SER RSER ASERDialect-specific model 2883 6256 1760

WaveNet-CTC with dialect ID and speaker ID (baseline model)S-D-S 3064 minus181 6417 minus161 3406 minus1646 minus662D-S-S1 3964 minus1081 6510 minus254 4515 minus2755 minus1363D-S-S2 3343 minus460 6483 minus227 3756 minus1996 minus894

Attention (5)-WaveNet-CTCS-D-S 4869 minus1986 6831 minus575 6322 minus4562 minus2374D-S-S1 5257 minus2374 6938 minus682 7142 minus5382 minus2813D-S-S2 4910 minus2027 7941 minus1685 6109 minus4349 minus2687

WaveNet-Attention (5)-CTCS-D-S 3075 minus192 6951 minus695 3421 minus1661 minus849D-S-S1 3317 minus434 6951 -695 3849 minus2089 minus1073D-S-S2 3116 minus233 6925 minus669 3414 minus1654 minus852

WaveNet-Attention (7)-CTCS-D-S 3039 minus156 7005 minus749 327 minus151 minus805D-S-S1 3528 minus645 6812 minus556 3803 minus2073 minus1081D-S-S2 3258 minus375 6274 minus018 3716 minus1956 minus783

WaveNet-Attention (10)-CTCS-D-S 3025 minus142 6925 minus669 3201 minus1441 minus751D-S-S1 3406 minus523 7005 minus749 4010 minus2250 minus1174D-S-S2 3185 minus302 5745 511 3365 minus1605 minus465

Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35

Figure 7 Syllable error rate of WaveNet-Attention-CTC for different lengths of the attention window

8 Complexity

5 Conclusions

is paper proposes a multitask learning mechanism withlocal attention based on WaveNet to improve the perfor-mance for low-resource language We integrate Tibetanmultidialect speech recognition speaker ID recognition anddialect identification into a unified neural network andcompare the attention effects on the different places in ar-chitectures e experimental results show that our methodis effective for Tibetan multitask processing scenarios eWaveNet-CTC model with attention added into the highlayer obtains the best performance for unbalance-resourcemultitask processing In the future works we will evaluatethe proposed method on larger Tibetan data set or ondifferent languages

Data Availability

e data used to support the findings of this study are availablefrom the corresponding author (1009540871qqcom) uponrequest

Conflicts of Interest

e authors declare that they have no conflicts of interest

Authorsrsquo Contributions

Hui Wang and Yue Zhao contributed equally to thiswork

Table 7 Speaker ID recognition accuracy () of three-task models

Architecture Model Lhasa-U-Tsang Changdu-Kham Amdo pastoralSpeakerID model 6775 9313 9531

WaveNet-CTC with dialect ID and speaker IDS-D-S 7291 988 9612D-S-S1 7021 9523 936D-S-S2 7035 9642 9689

Attention (5)-WaveNet-CTCS-D-S 6108 8333 8953D-S-S1 6212 8333 8701D-S-S2 6199 8452 9011

WaveNet-Attention (5)-CTCS-D-S 6199 8571 9205D-S-S1 6253 8214 9108D-S-S2 6118 8928 9244



Table 6 Dialect ID recognition accuracy () of three-task models


WaveNet-CTC with dialect ID and speaker IDD-S-S1 9801 988 9941D-S-S2 9973 9642 9961S-D-S 9925 9523 9903





Complexity 9

Acknowledgments

is work was supported by the National Natural ScienceFoundation under grant no 61976236

References

[1] Z Tang L Li and D Wang ldquoMulti-task recurrent model forspeech and speaker recognitionrdquo in Proceedings of the 2016Asia-Pacific signal and Information Processing AssociationAnnual Summit and Conference (APSIPA) pp 1ndash4 JejuSouth Korea December 2016

[2] O Siohan and D Rybach ldquoMultitask learning and systemcombination for automatic speech recognitionrdquo in Proceed-ings of the 2015 IEEE Workshop on Automatic Speech Rec-ognition and Understanding (ASRU) pp 589ndash595 ScottsdaleAZ USA December 2015

[3] Y Qian M Yin Y You and K Yu ldquoMulti-task joint-learningof deep neural networks for robust speech recognitionrdquo inProceedings of the 2015 IEEE Workshop on Automatic SpeechRecognition and Understanding (ASRU) pp 310ndash316Scottsdale AZ USA December 2015

[4] Aanda and S M Venkatesan ldquoMulti-task learning of deepneural networks for audio visual automatic speech recogni-tionrdquo 2020 httparxivorgabs170102477

[5] X Yang K Audhkhasi A Rosenberg S omasB Ramabhadran andM Hasegawa-Johnson ldquoJoint modelingof accents and acoustics for multi-accent speech recognitionrdquoin Proceedings of the 2018 IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP) CalgaryCanada April 2018

[6] D Chen and B K-WMak ldquoMultitask learning of deep neuralnetworks for low-resource speech recognitionrdquo IEEEACMTransactions on Audio Speech and Language Processingvol 23 no 7 pp 1172ndash1183 2015

[7] K Krishna S Toshniwal and K Livescu ldquoHierarchicalmultitask learning for ctc-based speech recognitionrdquo 2020httparxivorgabs180706234

[8] B Li T N Sainath Z Chen et al ldquoMulti-dialect speechrecognition with a single sequence-to-sequence modelrdquo 2017httparxivorgabs171201541

[9] S Toshniwal T N Sainath B Li et al ldquoMultilingual speechrecognition with a single end-to-end modelrdquo 2018 httparxivorgabs171101694

[10] Y Zhao J Yue X Xu L Wu and X Li ldquoEnd-to-end-basedTibetan multitask speech recognitionrdquo IEEE Access vol 7pp 162519ndash162529 2019

[11] A Das J Li R Zhao and Y F Gong ldquoAdvancing con-nectionist temporal classification with attentionrdquo in Pro-ceedings of the 2018 IEEE International Conference onAcoustics Speech and Signal Processing (ICASSP) CalgaryCanada April 2018

[12] Y Zhang P Y Zhang and H Y Yan ldquoLong short-termmemory with attention and multitask learning for distantspeech recognitionrdquo Journal of Tsinghua University (Scienceand Technology) vol 58 no 3 p 249 2018 in Chinese

[13] S Liu E Johns and A J Davison ldquoEnd-to-end multi-tasklearning with attentionrdquo in Proceedings of the IEEE ComputerVision and Pattern Recognition (CVPR) Long Beach CAUSA June2019

[14] Z Zhang B Wu and B Schuller ldquoAttention-augmented end-to-end multi-task learning for emotion prediction fromspeechrdquo in Proceedings of the IEEE International Conference

on Acoustics Speech and Signal Processing (ICASSP) Brigh-ton UK May 2019

[15] S Xu ldquoSpeech-to-text-wavenet end-to-end sentence levelChinese speech recognition using deepmindrsquos wavenetrdquo 2020httpsgithubcomCynthiaSuwiWavenet-demo

[16] Kim and Park ldquoSpeech-to-text-WaveNetrdquo 2016 httpsgithubcomburiburisuri GitHub repository

[17] A van den Oord A Graves H Zen et al ldquoWaveNet agenerative model for raw audiordquo 2016 httparxivorgabs160903499

[18] A Graves Supervised Sequence Labelling with RecurrentNeural Networks Springer New York NY USA 2012

[19] M Wei ldquoA novel face recognition in uncontrolled envi-ronment based on block 2D-CS-LBP features and deep re-sidual networkrdquo International Journal of IntelligentComputing and Cybernetics vol 13 no 2 pp 207ndash221 2020

[20] A S Jadhav P B Patil and S Biradar ldquoComputer-aideddiabetic retinopathy diagnostic model using optimalthresholding merged with neural networkrdquo InternationalJournal of Intelligent Computing amp Cybernetics vol 13 no 3pp 283ndash310 2020

[21] M-T Luong H Pham and C D Manning ldquoEffective ap-proaches to attention-based neural machine translationrdquo2020 httparxivorgabs150804025

[22] B La ldquoTibetan spoken languagerdquo in Chinese 2005

10 Complexity

different positions within WaveNet ie in the input layerand high layer respectively

e contribution of this work is three-fold For one wepropose the WaveNet-CTC with local attention to performmultitask learning for Tibetan speech recognition which canautomatically capture the context information among dif-ferent tasks is model improves the performance of theTibetan multidialect speech recognition task Moreover wecompared the performance of local attention inserted atdifferent positions in the multitask model e attentioncomponent embedded in the high layer of WaveNet obtainsbetter performance than the one in the input layer ofWaveNet for speech recognition Finally we conduct asliding window on the speech frames for efficiently com-puting the local attention

e rest of this paper is organized as follows Section 2introduces the related work Section 3 presents our methodand gives the description of the baseline model local at-tention mechanism and the WaveNet-CTC with local at-tention In Section 4 the Tibetan multidialect data set andexperiments are explained in detail Section 5 describes ourconclusions

2 Related Work

Connectionist temporal classification (CTC) for end-to-endhas its advantage of training simplicity and is one of themostpopular methods used in speech recognition Das et al [11]directly incorporated attention modelling within the CTCframework to address high word error rates (WERs) for acharacter-based end-to-end model But in Tibetan speechrecognition scenarios the Tibetan character is a two-di-mensional planar character which is written in Tibetanletters from left to right besides there is a vertical super-position in syllables so a word-based CTC is more suitablefor the end-to-end model In our work we try to introduceattention mechanism in WaveNet as an encoder for theCTC-based end-to-end model e attention is used inWaveNet to capture the context information among dif-ferent tasks for distinguishing dialect content dialectidentity and speakers

In multitask settings there are some recent works fo-cusing on incorporating attention mechanism in multitasktraining Zhang et al [12] proposed an attention mechanismfor the hybrid acoustic modelling framework based onLSTM which weighted different speech frames in the inputlayer and automatically tuned its attention to the splicedcontext input e experimental results showed that at-tention mechanism improved the ability to model speechLiu et al [13] incorporated the attention mechanism inmultitask learning for computer vision tasks in which themultitask attention network consisted of a shared networkand task-specific soft-attention modules to learn the task-specific features from the global pool whilst simultaneouslyallowing for features to be shared across different tasksZhang et al [14] proposed an attention layer on the top ofthe layers for each task in the end-to-end multitaskframework to relieve the overfitting problem in speechemotion recognition Different from the works of Liu et al

and Zhang et al [13 14] which distributed many attentionmodules in the network our method merely uses one slidingattention window in the multitask network and has itsadvantage of training simplicity

3 Methods

31 Baseline Model We take the Tibetan multitask learningmodel in our previous work [10] as the baseline model asshown in Figure 1 which was initially proposed for Chineseand Korean speech recognition from the work of Xu [15] andKim and Park [16] e work [10] integrates WaveNet [17]with CTC loss [18] to realize Tibetan multidialect end-to-end speech recognition

WaveNet contains the stacks of dilated causal con-volutional layers as shown in Figure 2 In the baseline modelthe WaveNet network consists of 15 layers which aregrouped into 3 dilated residual blocks of 5 layers In everystack the dilation rate increases by a factor of 2 in everylayer e filter length of causal dilated convolutions is 2According to equations (1) and (2) the respective field ofWaveNet is 46

Receptive fieldblock 1113944

n

i1Filterlength minus 11113872 1113873 times Dilationratei

+ 1

(1)

Receptive fieldstacks S times Receptive fieldblock minus S + 1 (2)

In equations (1) and (2) S refers to the number of stacksReceptive fieldblock refers to the receptive field of a stack ofdilated CNN Receptive fieldstacks refers to the receptive fieldof some stacks of dilated CNN and Dilationratei

refers to thedilation rate of the i-th layer in a block

WaveNet also uses residual and parameterized skipconnections [19] to speed up convergence and enabletraining of much deeper models More details aboutWaveNet can be found in [17]

Connectionist temporal classification (CTC) is an al-gorithm that trains a deep neural network [20] for the end-to-end learning task It can make the sequence label pre-dictions at any point in the input sequence [18] In thebaseline model since the Tibetan character is a two-di-mensional planar character as shown in Figure 3 the CTCmodeling unit for Tibetan speech recognition is Tibetansingle syllable otherwise a Tibetan letter sequence from leftto right is unreadable

32 Local Attention Mechanism Since the effect of eachspeech feature frame is different for the target label output atcurrent time considering the computational complexity weintroduce the local attention [21] into WaveNet to create aweighted context vector for each time i e local attentionplaces a sliding window with the length 2n centered aroundthe current speech feature frame on the input layer andbefore the softmax layer in WaveNet respectively and re-peatedly produces a context vector Ci for the current input(or hidden) feature frame x(h)ie formula for Ci is shown

2 Complexity


Ci 1113944i+n








4 Experiments









Speech wave

MFCC

WaveNet

CTC loss




Complexity 3

xT~xTndash15

Stack 3

Stack 2

Stack 1




Input

Outputdilation = 8




Input

Outputdilation = 8




Input

Outputdilation = 8


xT~xTndash30

xT~xTndash45


Vowel (u)

Vowel (i e o)

Post-

posts

crip

t

Posts

crip

t

Pres

crip

t

Root

Superscript

Subscript


4 Complexity






CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

Xjndashn Xj+n

Attentionmechanism

Concatenation

Xj

tanh tanh


αiindashn

xindashn

Ci

lowast

αiindash1

xindash1

lowast

αii+1

xi+1

lowast

αii+n

xi+n

lowast


Complexity 5






Softmax


CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

hjhjndashn hj+n

Attentionmechanism

Concatenation

tanh tanh




6 Complexity



























Complexity 7










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity


Ci 1113944i+n








4 Experiments









Speech wave

MFCC

WaveNet

CTC loss




Complexity 3

xT~xTndash15

Stack 3

Stack 2

Stack 1




Input

Outputdilation = 8




Input

Outputdilation = 8




Input

Outputdilation = 8


xT~xTndash30

xT~xTndash45


Vowel (u)

Vowel (i e o)

Post-

posts

crip

t

Posts

crip

t

Pres

crip

t

Root

Superscript

Subscript


4 Complexity






CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

Xjndashn Xj+n

Attentionmechanism

Concatenation

Xj

tanh tanh


αiindashn

xindashn

Ci

lowast

αiindash1

xindash1

lowast

αii+1

xi+1

lowast

αii+n

xi+n

lowast


Complexity 5






Softmax


CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

hjhjndashn hj+n

Attentionmechanism

Concatenation

tanh tanh




6 Complexity



























Complexity 7










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity

xT~xTndash15

Stack 3

Stack 2

Stack 1




Input

Outputdilation = 8




Input

Outputdilation = 8




Input

Outputdilation = 8


xT~xTndash30

xT~xTndash45


Vowel (u)

Vowel (i e o)

Post-

posts

crip

t

Posts

crip

t

Pres

crip

t

Root

Superscript

Subscript


4 Complexity






CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

Xjndashn Xj+n

Attentionmechanism

Concatenation

Xj

tanh tanh


αiindashn

xindashn

Ci

lowast

αiindash1

xindash1

lowast

αii+1

xi+1

lowast

αii+n

xi+n

lowast


Complexity 5






Softmax


CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

hjhjndashn hj+n

Attentionmechanism

Concatenation

tanh tanh




6 Complexity



























Complexity 7










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity






CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

Xjndashn Xj+n

Attentionmechanism

Concatenation

Xj

tanh tanh


αiindashn

xindashn

Ci

lowast

αiindash1

xindash1

lowast

αii+1

xi+1

lowast

αii+n

xi+n

lowast


Complexity 5






Softmax


CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

hjhjndashn hj+n

Attentionmechanism

Concatenation

tanh tanh




6 Complexity



























Complexity 7










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity






Softmax


CTC loss

MFCC

Speech wave

+

1 times 1

times

Dilatedconv

tanh σ

Residual

hjhjndashn hj+n

Attentionmechanism

Concatenation

tanh tanh




6 Complexity



























Complexity 7










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity



























Complexity 7










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity










Lhasa-Uuml-Tsang

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

0

5

10

15

20

25

30

Sylla

ble e

rror

rate

of t

hree

-task

mod

els o

n sp

eech

cont

ent r

ecog

nitio

n (

)

Wav

eNet

-atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TCChangdu-Kham

0

10

20

30

40

50

60

70

Wav

eNet

-Atte

ntio

n(5

)-CT

C

Wav

eNet

-Atte

ntio

n(7

)-CT

C

Wav

eNet

-Atte

ntio

n(1

0)-C

TC

Amdo pastoral

0

5

10

15

20

25

30

35


8 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity

5 Conclusions


Data Availability




















Complexity 9

Acknowledgments


References
























10 Complexity

Acknowledgments


References
























10 Complexity

Documents

MultitaskLearningwithLocalAttentionforTibetan …...2020/09/22 · diﬀerentpositionswithinWaveNet,i.e.,intheinputlayer andhighlayer,respectively. econtributionofthisworkisthree-fold.Forone,we