[IEEE 2012 5th International Conference on Biomedical Engineering and Informatics (BMEI) - Chongqing, China (2012.10.16-2012.10.18)] 2012 5th International Conference on BioMedical

978-1-4673-1184-7/12/$31.00 ©2012 IEEE 1571

2012 5th International Conference on BioMedical Engineering and Informatics (BMEI 2012)

Parallel Implementation of Neural NetworksTraining on Graphic Processing Unit

Yong Liu, Yeming Xiao, Li Wang, Jielin Pan, Yonghong YanThe Key Laboratory of Speech Acoustics and Content Understanding

Institute of Acoustics, Chinese Academy of SciencesBeijing, China

Abstract—Recently artificial neural network (ANN) especiallythe deep belief network (DBN) becomes more and more popularin the acoustic model training. In order to improve the speedof ANN, the Graphics Processing Unit (GPU) is used. Thispaper gives the training details of the Back-Propagation (BP)neural network acoustic model for speech recognition on GPU,including the parallel reduction application and asynchronousimplementation between CPU and GPU. It is 26 times fasterthan using the single thread Intel R⃝ MKL(Math Kernel Library)implementation.

Index Terms—GPU, BP neural network, acoustic model, speechrecognition

I. INTRODUCTION

After decades of research, the performance of automaticspeech recognition (ASR) systems in real scenarios lags be-hind human level performance. A lot of researchers use somenotable advances and new math or physical models in training.In the field of acoustic model, the Hidden Markov Model wasbe used [6], [14] and it had a great improvement on the speechrecognition. Then, the minimum classification error(MCE) [4],[7] and minimum phone error (MPE) [13], [12] were appliedto the training of acoustic model. Meanwhile, the languagemodel also had some modifications, such as the n-gram back-off word models [5], maximum entropy language models [15]and so on. However, the researchers are not satisfied with theresult of speech recognition. The main challenge of ASR isthe complexity and variance of speech. So there isn’t a perfectmath or physical model, such as the Hidden Markov Model-Gaussian Mixture Models(HMM-GMM) or Boltzmann Ma-chine [1], which is able to express the real speech faultlessly.

Recently, the deep belief networks are successfully intro-duced into the acoustic model and the network of recognitionis called the deep neural network (DNN). In general ASRsystems, the posterior probability P (state|x) for each statecomes from the GMM-HMM [19], [14] where x is the inputpattern. In the DNN system, the output of neural networkare the states’ posterior probability. [9] shows the obviousperformance of decreasing the phone error of neural networkcomparing with the GMM.

Unfortunately, the neural networks training is time-consuming. [16] shows that a 4-layer network with about 1Mweights, trained with 150 to 500 hours of speech, takes 42hours and 35 mins. If the training data becomes double, tripleor even more and the network becomes more complex, thetraining time will be unacceptable.

So a lot of researchers of neural networks use the GPU whenthey train the sophisticated networks. In our experiments, weachieve 26 times speed in Tesla C2050 and it is acceptable forthe cost of time. According to the result of recognition, theneural network acoustic model is available.

This paper is organized as follow. Section II shows thedifferent structures between CPU and GPU. Also, it gives thetime cost of matrix multiplication between them. Section IIIgives the details of implementing BP algorithm including theparallel reduction and matrixes storage in GPU. Section IVgives the experiment result. Concluding remarks and futuredirections are reported in Section V.

II. WHY USE THE GPU IN THE TRAINING?

A. Time cost in the training

In our experiments of training, the time cost of matrixmultiplication is close to 71.5% on CPU. That means it isthe bottleneck of training. So we must try our best to speedup the matrix multiplication. We can use the optimal matrixlib, such as Eigen [2] or Intel R⃝ MKL. Effective programmingby the Streaming Single instruction multiple data Extensions(SSE) in [18] is also a good choice.

Another available idea is to improve the hardware, such asthe more superior CPU or the GPU. We have a small testof matrix multiplication of 100 times between 3000 × 3000and 3000 × 3000 matrixes. GPU performs the CUDAMatlibrary [11] and costs 9.03s, while a CPU costs 335.48s usingthe single thread MKL and the speed distinction is close to37 times.

Fig. 1. The different structure between CPU and GPU.

1572

B. The structure of CPU and GPU

The architecture dissimilarity of CPU and GPU causes thedifferent properties in the computation. Figure 1 which comesfrom the NVIDIA document, shows that the GPU has moreArithmetic Logic Units (ALU) and fewer Control Units thanCPU. CPU fetches an instruction and data, then executes theinstruction.The assembly line of CPU means fetching the nextinstruction and data when the CPU implements the previousinstruction. The GPU loads one instruction and thousands ofdata at one time, then it will execute the same instruction forthese data in a batch. So the computation with less logicaloperations will fit for GPU. According to the Figure 1, wecan get the conclusion that GPU can solve more dense andhigh parallelism computation problems.

Because of the structure of neural networks, the node ofsame layer is independent. CPU has a great advantage ofimplementing the sequence instructions. All the instructionswill be performed in assembly line if the CPU is used inthe training. Supposing that all the nodes of the same layerare updated in parallel, the time will be reduced obviously.Associating with the previous paragraph, we hold that theGPU has the clear superiority over CPU on the neural networkcomputation.

III. THE TRAINING OF NEURAL NETWORKS

This section will give the details of BP neural networkstraining. The BP is one of fine-tune techniques in the DBN.In this paper, the pre-training is not considered.

A. Back-Propagation algorithm

In our system, the output layer are the states of HMM,which means that the output is the probability value. At first,the input patterns are required by the forced alignment which isperformed by the Viterbi algorithm off-line. Then we minimizethe mean square errors between the teacher signal (the resultof force-align) and the output of neural networks.

The corresponding nodes give the output for the input vectorXt = (x1, x2, · · · , xn) as:

neti = bi +

n∑j=1

wij · xj (1)

outi = f(neti) (2)

where n is the number of nodes of the current layer, andf(neti) is the sigmoid function for the intermediate layer,and the softmax function for the output layer. For a array ofx[1],x[i],. . . ,x[n], the softmax function we use is:

softmax(x[i]) =exp(x[i]− maxValue)∑ni=1 exp(x[i]− maxValue)

(3)

where maxV alue is the max value of x[1],x[i],. . . ,x[n].Equation 3 has a little different with the normal softmaxformulation because of the maxV alue. The normalization ofmaxV alue prevents the shake of exp(x[i]). It has no effecton the performance.

The sigmoid function is defined:

sigmoid(t) =1

1 + exp(−t)(4)

The matrix form of Equation 1 for a frame looks like:

Net = (W ·X)T +B (5)

where the result of W ·X is a n × 1 matrix, and B is a n-dimension vector. The operation between them means that theelements of W · X adds the B vector separately and the Bmust be extended when the input contains multi-frames.

The backward process is to update the weights. The objec-tive error function is:

E =1

2

c∑k=1

(tk − ok)2 =

1

2∥t− o∥2 (6)

According to the gradient descent algorithm, we can get theTable I by the formula derivation.

In our experiments, we load the a plenty of frames tomemory one time (which is called a cache) and weights willbe updated after a batch of frames (which is called a bunch).Table I shows the forward and backward formulas in detail.The parameter Fr is the number of handling frames at onetime and the T is the teacher signal which is a Fr ×NOUT

matrix and has only one non-zero value at every row.

B. Parallel reduction

Fig. 2. The general reduction algorithm for sum.

Fig. 3. The sum reduction algorithm for GPU.

Because of the library of CUDAMat, the training is veryeasy except the softmax function. The softmax must calcu-late the max and sum of a matrix for every row, but it is

1573

TABLE IThe BP algorithm.

Input Hid1 Hid2 Output

Func-Type sigmoid sigmoid softmax

Node-Num NIN NHID1 NHID2 NOUT

Input-Dim Fr ×NIN Fr ×NHID1 Fr ×NHID2

Weight NIN ×NHID1 NHID1 ×NHID2 NHID2 ×NOUT

Bias NHID1 NHID2 NOUT

Output-Dim Fr ×NHID1 Fr ×NHID2 Fr ×NOUT

OUT Sigmoid(WTHID1 Sigmoid(WT

HID2 Softmax(WTOut

·INTHID1 +B ) ·INT

HID2 +B ) ·INTOUT +B)

δ dim Fr ×NHID1 Fr ×NHID2 Fr ×NOUT

δ δHID1 = [OUTHID1 δHID2 = [OUTHID2 δOUT = (Tk −OUTk)

∗(1−OUTHID1)]∗ ∗(1−OUTHID2)]∗[WT

HID2 · δHID2] [WTOUT · δOUT]

∆Wdim NIN ×NHID1 NHID1 ×NHID2 NHID2 ×NOUT

∆W ∆W = η ∗ δTHID1 ∗ INHID1 ∆W = η ∗ δTHID2 ∗ INHID2 ∆W = η ∗ δTOUT ∗ INOUT

Wnew Wnew = Wold −∆W

∆B sum columns of δHID1 sum columns of δHID2 sum columns of δOUT

Bnew Bnew = Bold −∆B

not very simple to execute the sequence algorithm owe tothe framework of GPU. Of course, we can use the normalsequence instructions in GPU, but it lacks of efficiency. Sowe implement the general parallel reduction algorithm in thesoftmax function.

Figure 2 shows the general reduction for sum and the basicprinciple is that different threads deal with different elements.If there are n threads to implement the algorithm, the timecomplexity is O(log2n) rather than O(n).

Considering the thread-block model [10] of GPU, we canoptimize the implementation of sum and max. The maxnumber of threads is 2n (n depends on the type of GPU) in ablock, so we will distribute several blocks to get the sum of arow. For a m×k matrix, figure 3 shows the details of assigningblocks and threads. After figure 3, block(i,1)(i=1,2,· · ·,m) willstore the sum of i-th row into the same block. Then block(i,1)will finally gets the sum of i-th row into the first element usingthe implementation algorithm in figure 2. Generally, we shoulduse at least two blocks to deal with one row (the n threads of ablock depends on your configure and the elements is between2i−1 and 2i). If and only if the column num is the power of2 and no more than the max threads num in a block, we canuse only one block.

C. Asynchronous implementation

Due to the independent of CPU and GPU, they can im-plement different instructions in parallel. GPU focuses onthe complexity computing and the CPU handles the logicinstructions. Also, the CPU is able to finishing the datapreparation. The signal processing for the raw speech featuresand the random redirect in the same cache can be implementedasynchronously. We can create a new thread to deal with thedata preparation in CPU.

IV. EXPERIMENTS

In our system, the output layer is the 1329 states of HMMand the input layer is 462 nodes (consecutive 11 frames).The two hidden layer is 2048 nodes. The training set is3, 237, 888 frames about 9 hours data and the test set is 1543utterances about 1 hour. The configure of machine is theIntel(R) Xeon(R) X5650 CPU and NVIDIA Tesla C2050 GPUwith the Windows Server 2008 OS. The decoder of recognitionis the one-pass style [17].

A. Parallel reduction for softmax

Section III-B gives the detail of parallel reduction algorithm.For a 1024× 1329 matrix, We have the cost time comparisonwith the CPU and the normal GPU code, when the softmaximplements ten times. In Table II, normal means that a threadserially handles a row and uses the looping statement. Table II

TABLE IIThe comparison of softmax.

Max Sum Total Time

CPU 24ms 124ms 217ms

Normal 23.5ms 23.4ms 63ms

Reduction 3.5ms 2.9ms 16ms

shows the reduction is effective for the sequence algorithm inthe GPU. When we get the global error of training and theaccuracy rate of frames, the parallel reduction is also a goodchoice.

B. Neural network training and speech recognition

In the training, a cache contains 20, 480 frames whichmeans loading 20, 480 frames at one time and the weightswill be updated after 1024 frames (the bunch size). Table III

1574

Fig. 4. The training and recognition result.

shows the training time comparison between CPU and GPUfor one iteration.

TABLE IIIThe comparison of training time.

Time cost Speedup

CPU 140min baseline

multi-core CPU 35min 4×GPU 5min18s 26×

Besides that, we have an experiment of speech recognition.Figure 4 shows the accuracy frame rate result of the trainingand error rate in recognition for 20 iterations. The curvewith the ∆ label is the frame accuracy in the training andit increases from 21% to 56%. The curve with the × label isthe error rate of recognition and it decreases from 72.4% to60%. Also, the global error of training falls off from 1805 to1172 (the global error is unnecessary to display in the resultfigure, and it only shows the tendency of the training). Becauseof the overfit for training set, the recognition error keeps inthe level of 60% after the 11-th iteration.

V. CONCLUSION AND FUTURE WORK

Table III shows that the GPU is able to accelerate thetraining of neural networks by 26 times than using the IntelMKL libraries. It have an obvious improvement for the neuralnetwork training. That means that we can use more data andmore complex networks, these measures are likely to havea good effect on the performance of recognition. Figure 4displays the effect of iterations, of course, it becomes overfitafter the 11-th iteration because the training set is not verylarge.

The future works on the neural work is followed. Dahletc. [3] gives the experiments of GMM and DBN in the acous-tic model. Their experiments express that the DBN has a betterperformance than the GMM. After the training accelerationexperiment, we will give more training patterns and morecomplex networks in order to test the performance of neuralnetwork in the speech recognition of Chinese. Meanwhile,the pre-training of DBN will have a small improvement tothe correct of recognition and we will finish the pre-trainingexperiment. Besides that, it is possible to implement the

training of recurrent neural network based language model [8]on GPU because the training of language model is a challengefor CPU. Furthermore, decoding on GPU is also a tendency.If there is a great enhancement in decoding and training, thespeech recognition of real time will likely come true.

ACKNOWLEDGMENT

This work is partially supported by the National NaturalScience Foundation of China (Nos. 10925419, 90920302,61072124, 11074275, 11161140319) and the Strategic PriorityResearch Program of the Chinese Academy of Sciences (GrantNos. XDA06030100, XDA06030500).

REFERENCES

[1] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm forboltzmann machines. Cognitive science, 9(1):147–169, 1985.

[2] Jacob Benot and Guennebaud Gal. Eigen v3, 2010.[3] G.E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained

deep neural networks for large-vocabulary speech recognition. Audio,Speech, and Language Processing, IEEE Transactions on, 20(1):30–42,2012.

[4] B.H. Juang, W. Hou, and C.H. Lee. Minimum classification error ratemethods for speech recognition. Speech and Audio Processing, IEEETransactions on, 5(3):257–265, 1997.

[5] S. Katz. Estimation of probabilities from sparse data for the languagemodel component of a speech recognizer. Acoustics, Speech and SignalProcessing, IEEE Transactions on, 35(3):400–401, 1987.

[6] K.F. Lee. On large-vocabulary speaker-independent continuous speechrecognition. Speech Communication, 7(4):375–379, 1988.

[7] E. McDermott, T.J. Hazen, J. Le Roux, A. Nakamura, and S. Katagiri.Discriminative training for large-vocabulary speech recognition usingminimum classification error. Audio, Speech, and Language Processing,IEEE Transactions on, 15(1):203–223, 2007.

[8] T. Mikolov, M. Karafit, L. Burget, J. ernock, and S. Khudanpur.Recurrent neural network based language model. In Proceedings ofthe 11th Annual Conference of the International Speech CommunicationAssociation (INTERSPEECH 2010), 2010.

[9] A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling usingdeep belief networks. Audio, Speech, and Language Processing, IEEETransactions on, (99):1–1, 2010.

[10] NVIDIA. Cuda programming guide, 2012.[11] NVIDIA. Cuda toolkit, 2012.[12] D. Povey. Discriminative training for large vocabulary speech recogni-

tion. Cambridge, UK: Cambridge University, 2004.[13] D. Povey and P.C. Woodland. Minimum phone error and i-smoothing for

improved discriminative training. volume 1, pages I–105–I–108. IEEE,2002.

[14] L.R. Rabiner. A tutorial on hidden markov models and selectedapplications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[15] R. Rosenfeld. A maximum entropy approach to adaptive statisticallanguage modelling. Computer speech and language, 10(3):187, 1996.

[16] S. Scanzio, S. Cumani, R. Gemello, F. Mana, and P. Laface. Parallelimplementation of artificial neural network training. pages 4902–4905.IEEE, 2010.

[17] J. Shao, T. Li, Q. Zhang, Q. Zhao, and Y. Yan. A one-pass real-timedecoder using memory-efficient state network. IEICE TRANSACTIONSon Information and Systems, 91(3):529–537, 2008.

[18] V. Vanhoucke, A. Senior, and M.Z. Mao. Improving the speed of neuralnetworks on cpus, 2011.

[19] S. Young. A review of large vocabulary continuous speech recognition.Signal Processing Magazine, IEEE, 13(5):45, 1996.

Documents

[IEEE 2012 5th International Conference on Biomedical Engineering and Informatics (BMEI) - Chongqing, China (2012.10.16-2012.10.18)] 2012 5th International Conference on BioMedical