Thach Thesis

Embed Size (px)

Citation preview

  • 7/26/2019 Thach Thesis

    1/67

    Free decoding parameteroptimization

    for automatic speech recognition

    Rheinisch-Westfalische Technische Hochschule AachenFraunhofer Institute for Intelligent Analysis and Information

    Systems

    Master Thesis

    LE NGUYEN THACH

    MATR.-NR. 328585

    Examiners:

    Prof. Dr.-Ing. Christian BauckhageProf. Dr.-Ing. Mauro Brunato

    Advisor:Dr.-Ing. Daniel Stein

    Registration date: 24. January 2014Submission Date: . March 2014

  • 7/26/2019 Thach Thesis

    2/67

  • 7/26/2019 Thach Thesis

    3/67

    I hereby affirm that I composed this work independently and used no otherthan the specified sources and tools and that I marked all quotes as such.

    Ich erklare hiermit, dass ich die vorliegende Arbeit selbstandig verfasst undkeine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

    Aachen, den . Marsch 2014

  • 7/26/2019 Thach Thesis

    4/67

  • 7/26/2019 Thach Thesis

    5/67

    Abstract

    This thesis addresses the parameter optimization problem for Automatic SpeechRecognition (ASR) by investigating various optimization techniques includingSimultaneous Perturbation Stochastic Approximation, Simulated Annealing,Downhill Simplex, Evolution Strategies and Gradient Descent. All are well-known gradient-free algorithms to solve optimization problems. Although stud-ies on optimization algorithms already have a long history, only a few numberof research works have taken this approach for ASR parameter optimization.Nevertheless, all related scientific contributions are independent works and no

    thorough comparison has been offered. For this reason, this thesis aims to fillthe void by comparing the above-mentioned algorithms.

    Numerous experiments were conducted in this thesis to reach the goal. Thefirst set of experiments employed several standard functions in order to learnthe performance of the algorithm implementations in standard optimizationtests. The second experiment set was proceeded on the ASR system withdifferent performance targets and different decoding paradigms. This thesissuggests several objective functions for this purpose. The recognition perfor-mance is measured in terms of Word Error Rate (WER) and Real Time Factor(RTF). Two different decoders (Julius and Kaldi) and five different test sets

    (one development set and four evaluation sets) were employed for the experi-ments. The algorithm comparison focuses on solution quality and optimizationcost.

  • 7/26/2019 Thach Thesis

    6/67

  • 7/26/2019 Thach Thesis

    7/67

    Acknowledgments

    I would like to express my deep gratitude to Dr. Daniel Stein, who has beendirectly responsible for everything from beginning. He granted me a studentjob in Fraunhofer IAIS, he guided my work, he patiently listened to my brokenEnglish, he inspired me. And I must say that these are only a tenth of whathe did for me.

    My sincerest thanks go to Professor Christian Bauckhage and Professor MauroBrunato for willing to be my thesis supervisors. Their kind support is funda-mental for my thesis.

    I would also like to thank Michael Stadtschnitzer, who I have bothered a lotduring my thesis working. Thank you for always being nice and helpful to me.Michael was also responsible for the Kaldi-related experiments in this thesis.

    Many thanks to Dr. Rolf Bardeli for helping me when Daniel was away. TheSpeech Recognition Lab and Multimedia Analysis and Retrieval Lab are easilymy favourites with your and Daniels teaching.

    Special thanks to Professor Maurizio Marchese and Dr. Elena Bortolotti fromthe University of Trento, Dr. Juergen Rapp from the RWTH Aachen Uni-versity, who are the coordinators of the European Master in Informatics pro-gramme. They have done a great job to take care of the students, especiallyforeign students who arrived in Europe for the first time. I can say I have only

    positive comments for their efforts.

    Cheers to all of my friends and colleagues in Italy and Germany. Thank you forputting up with my annoying personality and giving me wonderful memoriesin Europe.

    Special thanks to Dr. Tran Vu Pham and Dr. Nam Thoai from Ho Chi MinhCity University of Technology. Their support is the reason that I can go toEurope for my study in the first place.

    Finally, my sincere thanks to my beloved family for being supportive when Iwas struggling with everything. I cannot think enough words for them.

  • 7/26/2019 Thach Thesis

    8/67

  • 7/26/2019 Thach Thesis

    9/67

    Contents

    1 Introduction 1

    2 Related work 5

    3 Automatic Speech Recogntion 9

    4 Gradient-Free Optimization 15

    4.1 Simultaneous Perturbation Stochastic Approximation . . . . . . 17

    4.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.3 Downhill Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.4 Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.5 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    5 Evaluation 31

    5.1 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    5.2 Automatic Speech Recognition Task. . . . . . . . . . . . . . . . 34

    5.2.1 Performance Metric. . . . . . . . . . . . . . . . . . . . . 35

    5.2.2 Objective Function . . . . . . . . . . . . . . . . . . . . . 35

    5.2.3 Experimental Set-up . . . . . . . . . . . . . . . . . . . . 36

    5.2.4 Unconstrained Optimization . . . . . . . . . . . . . . . . 37

    5.2.5 Time-constrained Optimization . . . . . . . . . . . . . . 39

    6 Conclusion 47

  • 7/26/2019 Thach Thesis

    10/67

    Bibliography 49

    A APPENDIX 53

  • 7/26/2019 Thach Thesis

    11/67

    1Introduction

    Automatic speech recognition (ASR) describes the type of applications thatcan (automatically) translate speech into text. The promising of ASR technol-ogy has seen it being applied in a large number of domains, from householdappliances such as video game console to more complicated system such as mil-

    itary aircraft. A typical ASR system often employs a decoding parameter setwhich is free to modify. Some examples of these parameters are the languagemodel weight, the insertion penalty, the beam width and the stack size whichcan be found in Julius, an open source speech recognizer [16]. A challenge inautomatic speech recognition field is how to find a set of optimal or near opti-mal parameters that can boost the performance of the system, accuracy-wiseor speed-wise. This thesis aims to tackle such problem.

    Generally, the correlation between several factors in a speech recognition sys-tem such as the acoustic model, the language model, the search algorithm orthe system platform implies that there is no universal optimal parameter setfor a decoder. In praxis, the parameters are often defined empirically. Such anapproach is not effective due to the number and range of the parameters andthe difficulties to learn how exactly these parameters affect the system. Anautomatic method to find the optimal configuration is then a more reasonablechoice. This method should be as universal as possible, i.e., it can be deployedeasily in different kind of automatic speech recognition system without muchmanually modification.

    In fact, it can be done theoretically with already well-known optimization al-gorithms such as Simulated Annealing [15] or Evolution Strategies [7]. The

  • 7/26/2019 Thach Thesis

    12/67

    2 1. Introduction

    parameter tuning problem generally exists in many sort of systems and thereare numerous studies showing that it can generally be solved by optimization

    algorithms [12][21] [22], hence automatic speech recognition system should notbe an exception. In this case, the decoder can be modelled as the optimization-target function with decoding parameters as input. Typically, one only needsto design the objective function (in other literature, it may be referred as fitnessfunction or loss function) to score how good the output of the decoding isand a starting point or a guess solution which can be the default configurationof the decoder. Obviously, since the concrete influences of the decoding pa-rameters on the speech recognition process is usually unknown or impracticalto learn, the objective function is a black-box type function, which implies anyoptimization technique relying on gradient direct measurement is not applica-ble. Furthermore, when processing time is considered in the objective function,it leads to inconsistency in return value since how long a running process fin-ishes depends heavily on the current state of the system. It is called the noisymeasurement of the objective function and the chosen optimization techniquesare required to be capable to handle this noise efficiently. In addition, thespeech decoding task is usually an expensive one, therefore the convergencespeed of the optimization process is also a highly important criteria.

    With all the above issues in mind, several gradient-free optimization techniquesare investigated in this work. They are Downhill Simplex [23], Simulated An-nealing[15], Evolution Strategies[7] and Simultaneous Perturbation Stochastic

    Approximation [32]). An addition is the Gradient Descent-based optimizationframework described by Hannani et al.[8] (for convenience, this approach willbe referred to as Gradient Descent in this thesis), which is actually designed fortime-constrained speech recognition task. Evolution Strategies and Simultane-ous Perturbation Stochastic Approximation algorithms were already tested onautomatic speech recognition task before [13][35] but they are all independentworks and none has attempted to work on direct comparison between them.Therefore, it is unclear which method is better for ASR parameter optimizationtask.

    This thesis aims to find out the answer by comparing the algorithms in thesame context, i.e., same system and same test sets. Consequently, the exper-iments are conducted to learn how these optimization methods perform withdifferent goals (i.e. accuracy improvement and speed improvement) and ondifferent decoding paradigms. The investigation focuses on the qualities of thesolutions found by optimization algorithms and the number of function evalua-tions needed to reach these solutions. It is based on our argument that typicalusers want quick and quality solution from the optimization process.

    This thesis is organized as follows:

  • 7/26/2019 Thach Thesis

    13/67

    3

    Chapter2 discusses related works. This includes some successful approachesin boosting the speech recognizer performance. Among them are some recent

    works taken the approach of employing optimization algorithms. These stud-ies shows that the gradient-free optimization algorithms indeed can improvespeech decoding performance in terms of speed and accuracy. In addition,this Chapter also discusses some comparison works on different optimizationalgorithms.

    Chapter3provides a brief review of automatic speech recognition. A con-ceptual automatic speech recognition system is described in this Chapter andfollowing by two speech decoding implementation used by this work.

    Chapter 4 introduces the gradient-free optimization techniques. Conse-quently, five methods namely Simultaneous Perturbation Stochastic Approxi-mation (SPSA), Simulated Annealing (SA), Downhill Simplex (DS), EvolutionStrategies (ES) and Gradient Descent (GD) are described.

    Chapter5presents the experiments conducted for this thesis. This includesexperimental set-up, optimization target, objective functions, and finally theoptimization results and evaluation. The content in this Chapter is dividedinto two sections which are benchmark experiments section and the ASR ex-periments section. The first one presents the standard optimization tests whilethe latter focuses unconstrained and time-constrained experiments on the ASRsystems.

    Chapter6draws the conclusion of this thesis.

  • 7/26/2019 Thach Thesis

    14/67

    4 1. Introduction

  • 7/26/2019 Thach Thesis

    15/67

    2Related work

    Parameter optimization works related to automatic speech recognition task in-cludes [13]in which Kacur and Korosi employed Evolution Strategies to find theoptimal parameter configuration for an ASR dialogue system. In the paper, theauthors also describe a number of Evolution Strategies that can be employed forthe task. The strategies include the choices of mutation method and combina-tion method, as well as the covariance matrix adaptation approach. However,it is unclear which exact strategy was used in their experiments. In addition,the work only focuses on optimizing the Word Error Rate and disregards thedecoding time. The information of optimization cost is also unknown.

    On the contrary, an ASR optimization framework based on Gradient Descentmethod proposed by El Hannani and Hain [8] has been employed to find opti-mal configuration under time constraint. This method constructs an optimalperformance curve of the decoding system, from which user can obtain optimalparameter set given the speed requirement.

    Simultaneous Perturbation Stochastic Approximation (SPSA) experiments withautomatic speech recognition were done by Stein et al. [35] in both time-constrained and unconstrained settings. The authors also suggested two time-constrained method namely delta and increasing to linearly penalized slowdecoding. In general, the results showed that SPSA method is suitable fordecoding parameter optimization task with remarkable improvement in per-formance. The empirical data implies that, the delta method can enhance theaccuracy while maintaining the length of decoding time, while the increasingmethod successfully increase the speed of decoding although at a price of ac-curacy degradation. However, since both method employs the same decoding

  • 7/26/2019 Thach Thesis

    16/67

    6 2. Related work

    time threshold, it is unclear from the result how this threshold impacts thedelta approach.

    Despite employing different methods and different systems, all of these worksconcluded that optimization algorithms indeed can improve considerably thespeech decoding output in terms of accuracy or decoding speed. Some of themclaimed their respective approach provided better alternative solution than gridsearch or random search, although without empirical evident. Nonetheless,since they are isolated works, neither of them directly compares the resultswith each other. The comparison is possible because all of the above-mentionedoptimization frameworks have generic nature, i.e., it can be applied easily ondifferent speech recognizers.

    A different approach is taken by Mak and Ko [19] in which a method callediterative linear programming was employed. However, this methods aim isabout optimizing specific coefficients such as language model weight and in-sertion penalty, hence it is incapable of tuning any unknown parameter. Theevaluation compares the result with solution provided by grid search methodand claims that the method can find better solution with small training dataon the development set, and achieve comparable performance on the evaluationset.

    By all means, the performance of an automatic speech recognizer can be im-proved not just by tuning the parameters. However, this is not the main focusof this thesis, thus only a few examples of such researches are offered as fol-lows. Alleva et al. [2] described a three-phased search algorithm to upgradedecoding accuracy with minimal additional computation cost. Rogers [28]considered parallelism to utilize shared memory multiprocessors to speed updecoding process. The resulted parallel system can run six times faster thanthe original sequential speech recognizer. Chan et al. [5] studied a set of fastGaussian Mixture Model (GMM) computation techniques in large vocabularycontinuous speech recognition (LVCSR) system. The study proposed a concep-tual scheme in which the technique set are categorized into four groups whichassociates with each layer of GMM computation. The aim of the study is to

    achieve real-time speech recognition (i.e. recognition time

  • 7/26/2019 Thach Thesis

    17/67

    7

    With some high probability1 ( a small number), how manyL(.) function evaluations, say n, are needed to achieve a solution

    lying in some satisfactory setS(

    ) containing

    .

    From this point of view, the number of loss evaluation n is calculated theoret-ically for each algorithm in different dimensional space (particularlyp = 1, 2, 5and 10). The results show that SPSA has a very competitive performance andtops the table in 3 out of 4 cases. However, the author also reminded thatthe results may not reflect the efficiency of algorithm in practice since they arealways adjusted before working with new problem.

    Rios and Sahinidis[27] provided a review of free-derivative algorithms and com-parison of the software implementations. The authors tested 22 solvers witha test bed includes convex and non-convex, smooth an non-smooth problems.The total number of problems is 502 which developed to 112,448 instances.From the results, the authors discussed the qualities of solvers regarding theirabilities to find global or near-global optimum, the qualities of solutions givenlimited number of objective function calls, the convergence speed and successrate.

  • 7/26/2019 Thach Thesis

    18/67

    8 2. Related work

  • 7/26/2019 Thach Thesis

    19/67

    3Automatic Speech Recogntion

    This Chapter aims to briefly introduce automatic speech recognition at a levelfor anyone who has no prior knowledge in this field to understand. A typicalautomatic speech recognition system accepts speech signal as input and pro-duces written text of what was spoken. The conceptual system depicted in

    Figure3.1includes four components namely Signal Analysis, Acoustic Model,Language Model and Global Search.

    The input signal is digitally sampled from an audio waveform (i.e. analogsignal). The signal is often represented as a vector of real numbers.

    The speech recognizer output, i.e., the hypothesis is a recognized word sequencewN1 which maximizes the posterior probabilityp(w

    N1 |x

    T1 ):

    [wN1 ]opt = arg maxwN1

    p(wN1 |xT1 ) (3.1)

    where xT1 is a sequence of acoustic observation.

    Typically, the probability distribution is unknown and is therefore substitutedby a product of the prior and the likelihood following Bayess rule (Eq. 3.2).Note that the denominator is removed in Eq.3.3because it is independent ofthe word sequence wN1 :

    p(wN1 |xT1 ) =

    p(wN1 ).p(xT1 |w

    N1 )

    p(xT1 ) (3.2)

  • 7/26/2019 Thach Thesis

    20/67

    10 3. Automatic Speech Recogntion

    [wN1 ]opt = arg maxwN1

    {p(wN1 ).p(xT1 |w

    N1 ))} (3.3)

    Consequently, the output is determined by maximizing the product of thetwo probability models: the language model p(wN1 ) and the acoustic modelp(xT1 |w

    N1 ) as in Eq.3.3. However, if one considers every possible word sequence

    of which the number can grow exponential, it is impractical to estimate everyof them. Therefore, the Global Search component is needed to decrease thenumber of candidates, it discards any word sequence that is unlikely the answer.

    Figure 3.1 Basic architecture of a statistical automatic speech recognitionsystem which is taken from [18].

    Signal Analysis

    The signal analysis component processes the input signal and produces anacoustic feature vector which can be used for speech recognition. The input

  • 7/26/2019 Thach Thesis

    21/67

    11

    signal is multiplied with a sliding window to create a set of overlapping seg-ments. The Fourier transform technique then is applied on these segments to

    yield the feature vector of the input signal. In practice, several more steps areemployed in producing the final acoustic feature vector. Common techniquesin signal analysis are based on Mel frequency cepstral coefficients (MFCC) [6]or perceptual linear prediction (PLP) [10].

    Acoustic Model

    The acoustic model captures the probability of observing an acoustic vectorxT1given a word sequence wT1. Typically, an acoustic model relies on a phoneme

    inventory and a pronunciation lexicon.A phoneme or phone is the smallest unit of speech which distinguish the ut-terances of a given language. For example: /l/ in kill and s in kiss are twophonemes that distinguish kill and kiss.

    The phoneme inventory typically stores statistical representations of phonemesusing Hidden Markov Models (HMMs)[3] for each n-phones (a n-tuple of con-text dependent phonemes). The speech recognizer picks matching HMMs withthe acoustic vector to generate a series of phonemes. The speech recognizerthen looks up in the pronunciation lexicon to find corresponding words.

    An acoustic model is created by training a speech corpus (audio files andtranscriptions).

    Language Model

    The language model captures the prior probabilityp(wN1 ). The model presentscharacteristics of language such as syntax or semantics. An m-gram languagemodel is widely used in large vocabulary recognition which assumes the proba-bility of recognizing a wordwnonly depends on its history of lengthm

    1. The

    history of a word is the words come before it in the context. This assumptionleads to an m 1-th order Markov model (Eq. 3.4).

    p(wN1 ) =NYn=1

    p(wn|wn1nm+1) (3.4)

    A language model can be trained from a large collection of text such as booksor journals.

  • 7/26/2019 Thach Thesis

    22/67

    12 3. Automatic Speech Recogntion

    Global Search

    The Global Search aims at finding the word sequence that maximizes a-posterioriprobability in Eq. 3.3. To reduce the number of hypotheses (i.e. word se-quences) and speed up the search process, the speech recognizer typically em-ploys pruning strategies for search algorithm. A widely used algorithm inspeech recognizer is Viterbi [24] search which employs beam-search pruningstrategy.

    Two examples of automatic speech recognizer are Julius and Kaldi which arewidely used both in scientific and commercial applications.

    Figure 3.2 An overview of Julius [16].

    Julius

    Julius [16] is an open-source, multi-platform, large-vocabulary speech recog-nizer written in C language and aims for both academic research and industrialapplication. Julius can be employed in automatic speech recognition systemor part of other applications. The input signal can be obtained from eitheraudio files or live audio stream. Julius supports MFCC as speech feature, N-gram model or rule-based grammars as language model, and HMM as acousticmodel.

  • 7/26/2019 Thach Thesis

    23/67

    13

    Kaldi

    Kaldi [26] is another open-source speech recognizer written in C++ whichtargets speech recognition research. The flexible structure of the toolkit allowsresearchers to easily integrate different tools of their own interest. Kaldisfeature extraction component (i.e. signal analysis) supports standard MFCCand PLP features for recognition. Kaldi accepts Gaussian Mixture Models andSubspace Gaussian Mixture Models[11]as acoustic model. In addition, it canalso easily extend to another kind of model such as Deep Neural Network [25].The language model is required to be in finite-state transducer(FST)-basedformat [1](conversion tool provided along with the toolkit).

  • 7/26/2019 Thach Thesis

    24/67

    14 3. Automatic Speech Recogntion

  • 7/26/2019 Thach Thesis

    25/67

    4Gradient-Free Optimization

    The gradient-free (derivative-free) optimization addresses the problem of min-imizing (or maximizing) a scalar-valued objective function IRp IR over adomain, possibly with upper bounds and lower bounds. Furthermore, thesetypes of algorithms assume that the gradient of the objective function is un-

    available, thus rely solely on the objective function return value.Indeed, this type of problem is very common since for most real-life appli-cations it is impractical to obtain the derivative information (it is either toocostly or unavailable). Therefore, the strong demands in this area leads to along history of research which see a number of approaches have been taken.Early day saw the works of Spendley et al. [34] and Nelder and Mead [23]withtheir simplex approach (its variation, namely Downhill Simplex, is among ourinterests). More recent works focus on the gradient-approximation approachwhich tries to approximate the gradient value and relying merely on the func-tion value. Some representatives of this type are Finite-difference Stochastic

    Approximation (FDSA) [14] and Simultaneous Perturbation Stochastic Ap-proximation (SPSA) [32]. These two also belong to the indeterministic andstochastic method group, which are preferable since they are better in handlingunreliable objective function measurement (noisy measurement) and escapinglocal minima. Another family of stochastic method is spanned by SimulatedAnnealing algorithm. However, since it was originally developed for discreteproblems, Simulated Annealing is widely regarded as a less powerful algorithmfor continuous problems. The final algorithm in our interests is EvolutionStrategies, particularly the state-of-the-art Covariance Matrix Adaptation ex-tension, which is also a stochastic method.

  • 7/26/2019 Thach Thesis

    26/67

    16 4. Gradient-Free Optimization

    Generally, these algorithms are referred to as unconstrained optimization algo-rithms. However, they can be adapted to the constrained optimization prob-

    lems by employing a penalty function. The constrained optimization find theoptimal point of an objective function in regard to a set of constraints on the pa-rameters. Typically, by employing a penalty function, a certain value is addedon the return value of the objective function whenever the constraints are vio-lated. Consequently, since the constraints can be violated to some extent, someliterature refers to this approach as soft-constraint. On the contrary, hard-constrained optimization refers to the type of optimization approach where theconstraints must be respected. The branch and bound algorithm is commonlyused to solve this type of problems. Nonetheless it is outside the scope of thisthesis as the soft-constrained approach is preferred to solve time-constrainedASR optimization.

    In addition, a Gradient Descent-based method proposed by El Hannani andHain [8]which is specially designed for speech recognition system is also coveredin this Chapter. Its idea is actually similar to one with gradient-approximationapproach using perturbation parameter evaluation but rather than finding anoptimal point, the Gradient Descent method focusses on constructing an opti-mal curve providing optimal Word Error Rate (WER) solution for any Real-Time Factor (RTF) level. It requires an initial point which can be obtainedfrom unconstrained optimization.

    Finally, because this work is more about practical experiment than theoreticalanalysis, the discussion for each algorithm focuses on the description and coversthe implementation concerns of the algorithms which were followed closely inthe experiments. The mathematical aspect is out of the scope for this thesis.Interested readers may want to refer to the cited papers in the Bibliographysection.

    Note that, for better readability, same terms are deliberately used to describethe algorithms across this section particularly and the document on a whole.One might find it different in other literature. The definitions are used asfollows:

    p: the dimension of the search space or the number of input parameters forthe objective function.

    = (x1 xp): a pdimensional vector of the input parameters (i.e. thesolution).

    : the optimal solution.

    0: the guess solution or the initial point for any optimization algorithm.

    L(): the objective (loss) function with input .

  • 7/26/2019 Thach Thesis

    27/67

    4.1. Simultaneous Perturbation Stochastic Approximation 17

    In the next section, we will go through the details of SPSA, Simulated An-nealing, Downhill Simplex, Evolution Strategies, and Gradient Descent in that

    respective order.

    4.1 Simultaneous Perturbation Stochastic Approx-

    imation

    Simultaneous Perturbation Stochastic Approximation (SPSA) is a gradient-freestochastic algorithm which was proposed by Spall [32] to minimize (or maxi-mize) an objective function with respect to the input parameters. In contrast

    to the gradient-based algorithms such as steepest descent or Newton-Raphson,which requires direct gradient measurement, the gradient-approximation algo-rithm family only require the objective function measurement. The gradientvalue is approximated instead. One of the oldest approaches in this familyis Finite-difference Stochastic Approximation (FDSA)[14] which employs thesame recursive form (Eq. 4.1) of SPSA. However, its heavy iteration cost incompared with SPSA make it inferior especially in high dimensional space. Inthe same paper, Spall [32]argued that, although gradient-based algorithm the-oretically have an advantage in the convergence rate, practical problem charac-teristics, where gradient information is unavailable, usually make gradient-free

    algorithm like SPSA more appealing.

    Application

    SPSA has a wide range of applications includes parameter optimization, faultdetection, simulated-based optimization, neural networking, human machineinteraction and many others [31]. Some example systems are adaptive optics,aircraft modeling and control, atmospheric and planetary modeling, cardiolog-ical data analysis and circuit design. The full list of applications and references

    can be found in the same paper.

    Algorithm

    The differentiable loss function to be optimized is denoted as L() in which is a p-dimensional vector of parameters. SPSA adopts the classical approachby recursively finding such that L/ = 0. However, SPSA assumes thatno direct measurement can be applied to find the gradient g() =L/ so anapproximate value will be use instead.

  • 7/26/2019 Thach Thesis

    28/67

    18 4. Gradient-Free Optimization

    Given a gain sequence ak, the general recursive form of SPSA in the k-thiteration is:

    k+1=k akg(k) (4.1)

    To approximate the gradient, first the current solution k is perturbed by arandom perturbation vector k, then multiplied by a positive scalarck(Eq.4.2and Eq.4.3). In this way, the perturbation vector affects all components ofsimultaneously (hence the name Simultaneous Perturbation), in contrast withthe FDSA method where each component of is perturbed independently.This idea greatly improves the speed of the algorithm in terms of iteration cost

    while maintaining a comparable accuracy level as claimed by the author[32].

    +k = k+ckk (4.2)

    k = k ckk (4.3)

    Eventually, the gradient g(k) in Eq.4.1can be approximated by the followingformula:

    gk

    =

    L+k

    L

    k

    2ckk1

    ...

    L+k

    L

    k

    2ckkp

    (4.4)

    As one can see, the gradient estimation formula requires only two loss func-tion measurement L

    +k

    and L

    k

    regardless of the problem dimension p.

    This feature makes SPSA especially efficient in high-dimensional problem. Fur-thermore, because of its stochastic characteristic, SPSA is capable to escapefrom local minima [20], thus it has the properties of both local optimizationand global optimization. The method also allows noisy function measurement.Further study from Spall [30] compared SPSA with other powerful stochas-tic method such as Simulated Annealing, Evolution Strategies, Genetic Algo-rithms and showed that SPSA presented competitive result with lower overallloss function calls.

  • 7/26/2019 Thach Thesis

    29/67

    4.2. Simulated Annealing 19

    Gain sequences

    However, one diffi

    culty in SPSA is how to choose the appropriate gain se-quences ak and ck. The sequences are given by Eq.4.5and Eq.4.6.

    ak= a

    (A+k) (4.5)

    ck= c

    k (4.6)

    in which k is the iteration number and a,c, A,, are constants. A guide tochoose these coefficients can be found in [33]. c is approximately the standard

    deviation of the noisy loss evaluation. If the objective function is noise-free, asmall positive value is assigned to c. However, this method of approximatingc is only effective if the solution is already near the desired optimum. Inpractice, it is often better to choose considerably larger cin order to have betterchance to escape from local optimum. Ais roughly less than 10 percent of thedesired maximum iteration, a is chosen such that a/(A+ 1) approximatesg(0). Finally, and should be less than 1.

    One should note that SPSA is a memoryless algorithm, it theoretically doesnot measure and save the objective function value at any point in the process(only L(k) and L(

    +k) are computed). However, in experimental context, the

    objective value of every solution found at each iteration should be measuredin order to keep track of the performance. This extra measurement in factincreases the load of the optimization work (3 instead of 2 function calls forevery iteration) but will not be counted in evaluation process since it actu-ally contributes nothing in the optimization process. In practical optimizationcontext, only the final solution needs to be measured.

    4.2 Simulated Annealing

    The optimization algorithm Simulated Annealing (SA) [15], as the name sug-gests, is an analogy of physical annealing in metallurgy. The process of anneal-ing is used to refine the quality of a material. The material is initially heated inhigh temperature, allowing its atom to move around to search for lower energystate. The temperature is controlled carefully in order to stabilize the processand cool down the material. In the final state, the material should be removedof its defects.

    In fact, Simulated Annealing is original designed to address discrete problems.However, it has also been used to tackle continuous variable problems. The

  • 7/26/2019 Thach Thesis

    30/67

    20 4. Gradient-Free Optimization

    Figure 4.1 A Simulated Annealing example. Source: http://www.stanford.edu/~hwang41/

    experiments in Chapter5proves that it is indeed useful for the free decodingparameter optimization problems.

    One should note that, the descriptions found in other materials often preferterms such as state and energy of the system. They are equivalent tosolution and objective/loss function value in this thesis since it is being

    described in a continuous manner.

    Algorithm

    Similarly, in computer sciences language, SA algorithm is used to refine thesolution of a problem. The algorithm starts with a guess solution and aninitial temperature. A candidate solution is generated by randomly picking aneighbour of the current solution. This is done by a neighbourhood function(Eq. 4.8). The candidate is always accepted if it is better than the currentsolution. On the other hand, a worse candidate can also be accepted with

    a probability P depending on the temperature. This mechanism allows anuphill-climbing move for a more extensive search and is represented by anacceptance probability function (Eq. 4.7). Furthermore, the cooling schedule(Eq.4.9) keep decreasing the temperature and consequently the probabilityP.Eventually, the system is going to stabilized since the uphill move should occurless frequently in later stages. At this point, the algorithm converges to a local(maybe or maybe not global) optimum. In summary, the three components,namely neighbourhood function, acceptance probability function and coolingschedule, constitute the Simulated Annealing algorithm. Figure4.1 depicts anexample run of Simulated Annealing to find the global maximum (instead of

    http://www.stanford.edu/~hwang41/http://www.stanford.edu/~hwang41/http://www.stanford.edu/~hwang41/http://www.stanford.edu/~hwang41/
  • 7/26/2019 Thach Thesis

    31/67

    4.2. Simulated Annealing 21

    the minimum): A jump from point Jump 1 to point Jump 2 helps the algorithmto escape from a local maximum (note that point Jump 2 is not necessarily

    higher than point Jump 1).An example of the acceptance probability is given in Eq. 4.7in which E isthe difference in energy (i.e. the value of the objective function), kb is theBoltzmann constant and T is the current temperature.

    P= expE

    kbT (4.7)

    The neighbourhood function used in this work:

    0 =+N(0, ) (4.8)

    The cooling schedule can be modelled as:

    Tk+1=qTk (4.9)

    where 0< q

  • 7/26/2019 Thach Thesis

    32/67

    22 4. Gradient-Free Optimization

    4.3 Downhill Simplex

    The Downhill Simplex algorithm, also known as Nelder-Mead method, wasproposed by Nelder and Mead [23] to search for the optima in a p-dimensionalspace by employing a simplex. It is one of the earliest works in the fieldof gradient-free optimization. The first simplex-based algorithm was actuallydeveloped by Spendley et al.[34] and only allows two type of transformationswhich are reflection and shrinkage. At a result, the simplex can only changeits size and position. Nelder and Mead later add two more transformations,namely reflection and expansion, which allow the simplex to change its shape.Due to its simplicity and small storage requirement, Downhill Simplex quicklybecame popular among the family of direct search methods.

    Figure 4.2 An example of 3-points simplex transformations in 2-dimensionalspace. Source: http://www.jakubkonka.com/2013/10/22/nelder-mead-simplex.html

    http://www.jakubkonka.com/2013/10/22/nelder-mead-simplex.htmlhttp://www.jakubkonka.com/2013/10/22/nelder-mead-simplex.htmlhttp://www.jakubkonka.com/2013/10/22/nelder-mead-simplex.htmlhttp://www.jakubkonka.com/2013/10/22/nelder-mead-simplex.html
  • 7/26/2019 Thach Thesis

    33/67

    4.3. Downhill Simplex 23

    Algorithm

    To find the lowest (optimal) point in the search space, a simplex is formed asa collection ofp+ 1 vertices i. During the search process, the simplex keepsmoving downhill by transforming itself (i.e. replacing the vertices) to adaptwith the surrounding landscape. The candidate vertex is computed based ona moving vectord and a transforming operation. Three standard transformingoperations based on the moving vector are:

    Figure 4.3 Three transforming operations of Downhill Simplex method.

    Reflection: Obtain new point by reflect the worst (highest) point over the premaining vertices center of gravity.

    Expansion: Expand the reflection only if the reflection is better than anypoint in the simplex. It basically shifts the reflection further from the gravitycenter.

    Contraction: Only if the reflection failed (i.e. the candidate point obtainedfrom reflection is not better than any point in the current simplex), then pickanother point. The new point should stay between the worst point and thegravity center.

    Since the algorithm is typically provided with only one initial point, it is nec-essary to span this point in order to construct the initial simplex. Spanningcan be achieved by moving the initial point in each dimension.

    At each iteration, a moving vector d to guide the transformation is computed(Eq.4.10).

    d= 1

    n

    n+1Xi=1

    i (n+ 1) worst!

    (4.10)

    where worstis the worst (highest) point of the current simplex. The movingvector d is basically the difference between the gravity center of the simplexexcluding worst.

    A summary of the downhill simplex version used in this thesis is as follows:

    1. Start with an initial point.

  • 7/26/2019 Thach Thesis

    34/67

    24 4. Gradient-Free Optimization

    2. Span the initial point to form a simplex.

    3. Repeat until stop conditions:

    Calculate moving vector d by Eq.4.10.

    Try reflection

    r =worst+ d (4.11)

    Ifr is better than best, try expansion:

    e= worst+ d (4.12) Ifr is worse than worst, try contraction:

    c=worst+ d (4.13)

    If r, e, c are worse than worst, try shrinkage (or reduction). This extratransformation shrinks the simplex by bringing all the simplex vertices closerto the best point.

    i= best+ (i best) (4.14)

    Else replace worst with the best point among r, e, c.

    4. Check stop conditions.

    a) Computing time exhausted (i.e. number of loss function call exceeds maxi-mum allowance).

    b) Moving vector d becomes insignificant and unable to make any difference

    for the simplex transformation.

    The standard values for the coefficients are = 2.0, = 3.0, = 0.5, = 0.5.One may find the standard values different in other literature, however it isonly due to different formulae used.

    As it was mentioned above, the algorithm requires simplex the constructionstep at the initial stage which costs p+ 1 objective function evaluation, henceit suffers from slow start in case of large-p dimensional space, especially insituation where the cost of one function call is considerably expensive. Afterthe initialization, each iteration costs either 1,2,3 or p +3 function evaluations.

  • 7/26/2019 Thach Thesis

    35/67

    4.4. Evolution Strategies 25

    4.4 Evolution Strategies

    The Evolution Strategies (ES) approach is a branch of the Evolutionary Al-gorithm family, next to Genetic Algorithm and Evolutionary Programming [7]and also belongs to the stochastic algorithm group. ES was originally devel-oped for parameter optimization tasks in the 1960s as an intimation of organicevolution as in Evolution Theory. The algorithm is employed by Kacur andKorosi [13] for accuracy optimization with speech recognition system. Somenotations of Evolution Strategies such as mutation, recombination, selection,fitness, individual, generation, parents, offspring are borrowed from the processof natural selection.

    Terms

    An individual is a problem solution. Fitness generally means how good anindividual is and is task specific. A group of individuals forms a generation.This group reproduces and then is called parents. The reproduction processemploysrecombinationandmutationto produce a number of individuals whichare referred asoffspring. The selection process only keeps the fittest individualsfor the next generation and discards the rest, in other words, it is the processof survival of the fittest.

    An Evolution Strategy is often denoted as (/,) or (/+). Hereis parentpopulation, is the number of selected individuals from parent populationfor reproduction (i.e. recombination and mutation) and is the offspringpopulation generated from. The comma-selection,means the next generationare selected solely from offspring set while the plus-selection strategy + pickssurvivors from the union ofand .

    Strategies

    Some common strategies for recombination are intermediate and discrete [13].

    For intermediate strategy, an off

    spring is the mean of its parents. For discretestrategies, an offspring is constructed by picking components from parents byrandom. On the other hand, mutation can be achieved by adding a stochasticfactor to a solution on the assumption that the components of solution arestochastic independent:

    0 =+N(0, ) (4.15)

    in whichN(0, ) is a Gaussian distribution and0 is the mutation of the originalvector .

  • 7/26/2019 Thach Thesis

    36/67

    26 4. Gradient-Free Optimization

    Covariance Matrix Adaptation

    A more complex approach employs multivariate Gaussian distributions in con-sideration of the correlation between each component (in a parameter opti-mization problem, a component is typically a parameter). The offspring isthen sampled from a multivariate normal distribution N(m,2C).

    0 N(m,2C) (4.16)

    0 m+N(0, C) (4.17)

    in which

    mis the mean vector of the previous generation.

    is the standard deviation or step-size.

    N(0, C) is a multivariate normal distribution with mean zero and covariancematrix C.

    The covariance matrixCand meanm theoretically can be estimated from thecurrent population. However, this approach implies that at each generation thecovariance matrix must be built from scratch which is quite expensive in somecases. Therefore, an updating strategy is employed to adapt the matrix stepby step, hence the name Covariance Matrix Adaptation Evolution Strategies(CMA-ES). The selection process for the next generation is similar to one inthe canonical version, i.e., only the fittest individuals are kept. Consequently,mean m, covariance matrix C, and step-size are adapted to fit the chosenpopulation. In CMAE-ES, the previous generation is often disregarded sincethe distribution already contains sufficient knowledge for the next generation(similar to the comma strategy). The details of the adapting method aredescribed thoroughly by Hansen [9]. For convenience, the CMA-ES relatedexperiments in this thesis employed the CMA-ES tool [9]written by the sameauthor.

    According to Kacur and Korosi [13], experiments with Automatic SpeechRecognition system showed promising results at low cost in comparison withmanually adjustment. However, it suffers worse results in early stage due tothe dispersion of the population.

    4.5 Gradient Descent

    El Hannani and Hain[8] proposed another method based on Gradient Descentapproach to find the optimal configuration by tracking the optimal curve de-

  • 7/26/2019 Thach Thesis

    37/67

    4.5. Gradient Descent 27

    scribing a function of optimum WER value given RTF value. The definitionof WER and RTF will be introduced in Chapter5with more detail. Basically

    WER is used to measure recognition accuracy while RTF is used to measurerecognition relative speed. In contrary to the previous techniques, this one isdesigned only for the time-constrained automatic speech recognition optimiza-tion task.

    Figure 4.4 Two optimal performance curves given by the Gradient Descentmethod [8].

    Algorithm

    The optimal curve is given by Eq.4.18.

    Copt(r) = min:R()=r

    C() (4.18)

    where C() is the cost function, is a vector of parameter input (or the solu-tion), R() is the RTF of input . In the original paper, the author assumedthat the optimal curve is smooth and unique [8]. The solution of the equationis given by Eq.4.19.

    opt(r) = arg min:R()=r

    C() (4.19)

  • 7/26/2019 Thach Thesis

    38/67

    28 4. Gradient-Free Optimization

    To track the optimal curve, a starting point is computed first by applying anyunconstrained search method (any of the above-mentioned algorithms). Note

    that a point in the optimal curve is described by the WER and RTF value,not the solution itself. Indeed, in assumption that the unconstrained searchmethod can provide a point with the optimum WER, it definitely stays on theoptimal curve. In each iteration, a set of solution candidates are generatedfrom the perturbation of the previous optimal points solution to estimate thegradient of the curve (Eq. 4.20). This perturbation is done independently oneach component of the previous solution. In other words, if the size of thesolution is p, there are consequently p candidates. The next solution is thenchosen under the constraint of minimal cost (Eq. 4.21). As a result, the nextpoint of the optimal curve is constructed. The process is repeated until it isunable to find any point with lower RTF.

    ki(t 1) = i(t 1) opt(t 2)

    R(i(t 1))R(opt(t 2))(4.20)

    where i is the candidate solution i, k is the approximate gradient value and tis the current iteration number.

    CminWER(t) = WER(opt(t 1) k(t 1)) (4.21)

    where is an exponential decay function (t) =(0)et.

    The original paper also mentioned that given a parameter vector , unlike areproducible WER value, the RTF value is always distorted by additive noisefactor. This is expected because the decoding speed depends heavily on thecurrent state of the computer. Due to this reason, the optimal curve can onlybe obtained by interpolation.

    Despite providing more information on the optimization task by presenting anoptimal curve, this method is in fact very time-consuming in comparison withother methods. Its single component perturbation approach requires 2 functionevaluations for each component (similar to FDSA), hence a cost of 2pfunctioncalls for each iteration in pdimension search space. One also needs to take intoaccount the cost to compute the starting point since it is also expensive. Onthe other hand, the optimal curve contains a great amount of information asit can provide multiple optimal solutions with different decoding speed. Themethod also has a generic nature so it theoretically can be applied for anyautomatic speech recognition system.

  • 7/26/2019 Thach Thesis

    39/67

    4.6. Summary 29

    4.6 Summary

    This chapter provides a review on five different gradient-free optimization tech-niques investigated in this thesis, namely Simultaneous Perturbation, Simu-lated Annealing, Downhill Simplex, Evolution Strategies and Gradient De-scent. The last one is specifically designed to optimize speech recognitionsystem. The review focuses on implementation concerns of the algorithms inorder to prepare for the experiments in the next Chapter.

  • 7/26/2019 Thach Thesis

    40/67

    30 4. Gradient-Free Optimization

  • 7/26/2019 Thach Thesis

    41/67

    5Evaluation

    This chapter provides descriptions and results of experiments conducted withthe optimization algorithms described in the previous chapter. It aims todirectly compare the algorithm on different tasks under two criterion: the

    quality of the solution and the convergence speed. The experiments primarilyare divided into two groups: the first group of experiments are tests withstandard functions such as Ackley, Rosenbrock, Dixon-Price etc., the secondgroup consists of tests with objective functions designed to optimize automaticspeech recognition performance.

    In this chapter, the definitions of Word Error Rate (WER) and Real TimeFactor (RTF) are introduced as the measurement metrics for accuracy andspeed respectively for automatic speech recognition tasks. Objective functionmeasuring the Word Error Rate can be found in unconstrained optimization

    section. On the other hand, objective functions employing time-constraint aredescribed and tested in constrained optimization section. For the constrainedcase, an extra value is added upon the original return value of the unconstrainedobjective function (the Word Error Rate) to serve as a penalty whenever theReal Time Factor exceeds a certain threshold. Consequently, different penaltyapproaches can lead to different results.

    From this chapter, one can also learn the experimental set-up, i.e., the speechrecognizers used in the experiments. Finally, the results are presented alongwith the evaluation.

  • 7/26/2019 Thach Thesis

    42/67

    32 5. Evaluation

    Function lowerbound upperbound L() 0Ackley -32.768 32.768 (0 0) 0 (15.0 15.0)

    Dixon-Price -10.0 10.0 (2

    2i2

    2i 2

    2i2

    2i ) 0 (5 5.0)Rosenbrock -5.0 10.0 (1 1) 0 (-2.0 -2.0)

    Schwefel -500.0 500.0 (1 1) 0 (50.0 50.0)Griewank -600.0 600.0 (0 0) 0 (50.0 50.0)

    Sphere -5.12 5.12 (0 0) 0 (2.0 2.0)Levy -10.0 10.0 (1 1) 0 (9.0 9.0)

    Zakharov -5.0 10.0 (0 0) 0 (2.0 2.0)

    Table 5.1 Lower bound, upper bound, optimal points and initial points ofthe test functions.

    5.1 Benchmark

    This section serves as preliminary result of the optimization algorithms testing.Various test functions are employed: multi-modal functions including Ackley(Eq.A.1), Schwefel (Eq.A.4), Griewank (Eq.A.5) and Levy (Eq.A.7), bowl-shaped functions including Sphere (Eq.A.6), valley-shaped functions includingDixon-Price (Eq. A.2) and Rosenbrock (Eq. A.3), plate-shaped functions in-cluding Zakharov (Eq.A.8). All of the functions are extensible, i.e., one caneasily change the number of input parameters in order to acquire similar func-tions in higher or lower dimensional space. This property is necessary fortesting performance in high-dimensional space which is common for practi-cal problems. Additional information about the test functions, including theupper bounds, the lower bounds, the optimal point, the initial point can beseen in Table5.1. Additional information including plots and equations of thefunctions can be found in the Appendix.

    Once again, tested algorithms are Simultaneous Perturbation Stochastic Ap-proximation, Simulated Annealing, CMA-ES, Downhill Simplex as already dis-cussed in the previous chapter. Since the Gradient Descent approach is spe-cialized on automatic speech recognition optimization, it is excluded in this

    section. In fact, several comparison works were done before as it was alreadypointed out in Chapter2with even greater details and greater scale. Nonethe-less, the following information is still essential since one might need to knowhow the implementations of the algorithms used in this thesis perform in astandard optimization test.

    Further tests employ noise in the objective function. Noisy objective functionis obtained by adding a noise factor as in Eq. 5.1.

    Lnoise() =L() +N(0, ) (5.1)

  • 7/26/2019 Thach Thesis

    43/67

    5.1. Benchmark 33

    Function p L(0) SPSA DS SA CMA-ESAckley 2 3.63 0.00 0.00 0.01 0.00

    10 3.63 0.00 0.00 0.02 0.0050 3.63 3.59 0.05 0.12 0.00Dixon-Price 2 2 041.00 0.01 0.00 0.00 0.00

    10 91 141.00 5.77 0.51 0.00 0.50Rosenbrock 2 3 609.00 2.23 0.00 0.17 0.00

    10 32 481.00 10.48 9.61 0.85 0.00Schwefel 2 767.08 710.70 710.70 0.02 118.44

    Griewank 2 2.92 1.34 1.34 0.01 0.0410 7.25 3.95 6.28 0.06 0.00

    Sphere 2 8.00 0.00 0.00 0.00 0.0050 200.00 0.00 0.16 0.00 0.00

    100 200.00 0.00 24.25 0.65 0.00Levy 2 36.32 5.76 0.00 0.00 0.00

    Zakharov 2 98.00 0.00 0.00 0.00 0.0010 9 1 53 6 90.00 32.15 33.66 0.17 0.00

    Table 5.2 Optimization results with noise-free measurement.

    For each function, the experiments are conducted with different problem size(i.e. dimensional size of solution). The initial points are set carefully to avoidthe global minima valley where the global minima resides, with an intention

    to trap the search process in the local minima in case of multi-modal function.

    Table 5.2presents the optimization results with noise-free function measure-ment. Generally, all four algorithms perform better in lower dimensional spacethan in higher dimensional space. They are able to find the optimal or nearoptimal point in most of the test cases. Although it was mentioned beforethat Downhill Simplex usually suffers from slow initialization, the issue can-not be seen here because the dimensional size of the problem is insignificantin comparison with the total number of objective function calls. CMA-ESresults are at the top of the table with a successful optimization in most ofthe tests whereas the remaining three algorithms have relatively comparable

    performance. However, the Downhill Simplex results show that it is the mostlikely approach to be trapped in a local minima while Simulated Annealing hasthe best chance to escape, notably as the only successful algorithm in Schwefeltest. This advantage arguably comes from the searching strategy of SimulatedAnnealing in which the neighbourhood size is fixed. Downhill Simplex andthen SPSA results show that their solution qualities are more sensitive to anincreasing problem size.

    The next set of experiments employs a noise factor in the objective functions(Table5.3). Generally it can be said that all the algorithm successfully handle

  • 7/26/2019 Thach Thesis

    44/67

    34 5. Evaluation

    Function p L(0) SPSA DS SA CMA-ESAckley 2 3.63 0.26 0.00 0.28 0.13

    10 3.63 1.82 0.00 6.70 0.5350 3.63 20.41 0.02 20.73 19.00Dixon-Price 2 2 041.00 0.24 0.98 0.13 0.33

    10 91 141.00 4.79 0.99 2.55 0.77Rosenbrock 2 3 609.00 1.13 0.04 0.15 0.54

    10 32 481.00 58.61 8.99 65.44 9.20Schwefel 2 767.08 710.70 710.71 0.25 238.69

    Griewank 2 2.92 1.34 0.00 0.17 0.0610 7.25 4.42 0.00 1.85 1.41

    Sphere 2 8.00 0.00 0.00 0.05 0.0550 200.00 0.00 0.00 2.55 0.31

    100 200.00 0.00 0.00 8.34 0.81Levy 2 36.32 5.77 0.73 3.61 0.01

    Zakharov 2 98.00 0.01 0.00 0.18 0.1710 9 1 53 6 90.00 28.93 0.00 2.70 0.20

    Table 5.3 Optimization results with noisy measurement

    noise since they can maintain the quality level of solutions for the majority ofthe test cases. Downhill Simplex is the most resilient one in this aspect sinceit can offer even better solutions when compared with noise-free experiments.

    One can speculate that the noise becomes an additional stochastic factor forthe algorithm Downhill Simplex to avoid being stuck in a local minimum.

    5.2 Automatic Speech Recognition Task

    Parameter optimization for automatic speech recognition aims to improve

    the performance of such system by finding more suitable parameter values.The tested algorithms include SPSA, Simulated Annealing, Downhill Simplex,CMA-ES (for both unconstrained and time constrained experiments) and Gra-dient Descent (for only time constrained experiments).

    The performance metrics and the objective function , followed by the exper-imental set-up information regarding the speech recognition system configu-ration and the decoding paradigms can be found next. The experiments arethen organized according to the optimization goals: the first group considersunconstrained tests while the second group considers time-constrained tests.

  • 7/26/2019 Thach Thesis

    45/67

    5.2. Automatic Speech Recognition Task 35

    5.2.1 Performance Metric

    The performance of a speech recognizer can be expressed in terms of accuracyand speed. For accuracy measurement, Word Error Rate is widely used inmachine translation and speech recognition field. The metric WER measuresthe difference between a hypothesis (decoding output) and a reference (audiotranscript) as follows.

    WER =S+D+I

    N (5.2)

    in which

    S is the number of substitutions, D is the number of deletions,

    I is the number of insertion, and

    N is the length of the reference (number of words in the transcript file).

    S+D+Iis the Levenshtein distance [17] between two sequences (the hypothesisand the reference) presents the minimum number of operations (substitution,deletion and insertion) required to correct the hypothesis.

    On the other hand, decoding speed is measured by the Real Time Factor (RTF)

    which is the ratio between processing time and input duration (Eq. 5.3).

    RTF =Processing Time

    Input Duration (5.3)

    5.2.2 Objective Function

    Based on the works in[35], this thesis defines the general form of the objectivefunction to measure Speech Recognizer performance by Eq. 5.4.

    L() = WER() +k max(0, RTF() t) (5.4)

    where WER() (RTF()) is the WER (RTF) measurement of the recognitionoutput with parameter set . This is a soft-constraint approach in which theobject function is linearly penalized when RTF value violates the constraint. kis the penalty weight (should be positive) and t is the threshold. The thresholdvalue is defined based on the desired target of the system in consideration ofthe standard performance (i.e. with default parameter configuration). Forexample, if one desires the system to be capable to do real-time decoding, then

  • 7/26/2019 Thach Thesis

    46/67

    36 5. Evaluation

    the threshold value should be less than 1. Apparently, if the computed RTFvalue is lower than the threshold t, there should be no penalty. On the other

    hand, the penalty weight k is set as 0 thus there is no punishment on slowdecoding time in unconstrained test. For constrained optimization, k should

    be non-zero positive, section5.2.5discussed various options and experimentsto set upk value.

    5.2.3 Experimental Set-up

    The experiments in this thesis are mainly conducted on the Julius decoder [16]which employs Gaussian Mixture Models-Hidden Markov Models (GMM-HMM).

    Further experiments are done on the Kaldi [26] decoder which employs DeepNeural Network (DNNs) [11] and Subspace Gaussian Mixture Models (SG-MMs) [25]. Both techniques draw considerable research attention in recentliterature as they outperform GMM-HMM driven techniques in most interna-tional benchmarks.

    Julius parameter information is provided in Table 5.4. There are ten param-eters in total (Julius requires a pair of Language Model weight parameters, apair of insertion penalty, and a pair of beam width parameters as its approachconsists two-passes decoding technique) with start values, minimum values andmaximum values (i.e. lower bounds and upper bounds). Similarly, Table5.5provides Kaldi parameter information. The start values are used as initial (i.e.guess) solution and serve as baseline for all experiments.

    The acoustic training data consists roughly 323 hours of German broadcastand talk shows. The test sets, one for developing and four for evaluation, areacquired from DisCo [4] and LinkedTV project[29].

    name start min max(2) LM weight 10.0/10.0 0.0 20.0(2) ins. penalty -7.0/10.0 -20.0 20.0

    (2) beam width 1 500/250 700/20 3000/1000score envelope 80.0 50.0 150.0stack size 10 000 500 20 000#expanded hyp. 20 0 00 2 0 00 20 0 00#sentence hyp. 10 5 1 000

    Table 5.4 Free parameters of the decoding process in Julius. Some parame-

    ters are given individually to the 1st pass or 2nd pass of the Juliusdecoder, and are marked with (2). Continuous parameters aremarked by a trailing .0

  • 7/26/2019 Thach Thesis

    47/67

    5.2. Automatic Speech Recognition Task 37

    name start min maxdecoding beam 11.0 1.0 20.0

    lattice beam 8.0/6.0 1.0 20.0acoustic scaling 0.1 0.05 0.2maximum active states 7 000 1 000 15 000speaker vector beam 4.0 1.0 10.0

    Table 5.5 Free parameters of the decoding process in Kaldi. Continuousparameters are marked by a trailing .0. Speaker vector beam isexclusive for SGMM decoder. Lattice beam defaults to 8.0 forDNN decoder and 6.0 for SGMM decoder.

    5.2.4 Unconstrained Optimization

    The aim of the following optimization experiments is to optimize the Word Er-ror Rate regardless of the decoding time, thus the penalty weightkin Eq.5.4isset to zero. Eventually, the penalty part is removed and the objective functionsimply measures the Word Error Rate of the decoding output:

    L() = WER() (5.5)

    Evaluation

    Table5.6presents the results of unconstrained optimization on Julius param-eters. With the exception of the baseline output, each algorithms results arepresented in two rows: the first row presents the solution provided in an earlystage of optimization process, i.e., after 21 objective function evaluations, thesecond row presents the best solution it can find and the cost spent to reachthat solution. Cost is measured in terms of objective function calls (Juliuscalls). In general, the speech recognizer performance in terms of WER im-proves considerably in all cases. Simulated Annealing provided best solutionon the development set with lowest WER value of 27.0 while the other methodsproduced less remarkable but still comparable results. However, it takes 129

    function evaluations for the Simulated Annealing method to reach this solutionwhich is quite costly (a Julius call on development set takes roughly 1.5 hour inaverage). On the contrary, SPSA reaches the final solution after only 64 func-tion evaluations. The first row of each algorithm results also reveals that SPSAcan quickly improve the Speech Recognition performance in terms of accuracy.After just 21 function evaluations, solution found by the SPSA method can al-ready achieve a WER value of 27.8 while the second best output (28.2) comesfrom the CMA-ES method. On the other hand, the WER enhancements of theremaining two methods at this point are insignificant, with a decrease of 0.1and 0.3 by Downhill Simplex and Simulated Annealing, respectively. The last

  • 7/26/2019 Thach Thesis

    48/67

    38 5. Evaluation

    four columns of the table shows that the optimized parameter sets maintainits effectiveness on the evaluation sets.

    GMM-HMM (Julius) #eval RTF WER WER WER WER WERDev. Dev. DiSCo DiSCo LinkedTV LinkedTV

    planned spont. planned spont.+ baseline 1 4.5 29.6 24.8 31.3 26.5 49.8+ Downhill Simplex 21 4.1 29.5 24.7 31.2 26.5 49.8+ Downhill Simplex 95 5.2 27.4 22.2 27.5 24.2 45.0+ SPSA 21 6.3 27.8 22.8 28.7 24.8 46.8+ SPSA 64 6.0 27.5 22.3 27.8 24.4 45.5+ CMA-ES 21 5.6 28.2 23.1 28.8 25.2 46.9+ CMA-ES 111 5.5 27.4 21.9 27.0 24.1 44.5

    + Simulated Annealing 21 3.6 29.3 24.2 30.0 26.3 48.7+ Simulated Annealing 129 5.8 27.0 21.7 26.9 24.0 44.3

    Table 5.6 WER [%] results of ASR system configuration on various corpora.

    Figure5.1provides more details about the WER development in early stages.It strengthens the argument that SPSA is a fastest method in convergence withthe dramatic dropping of WER can be observed in the first iteration. SimilarWER development occurs in the case of CMA-ES although the high iterationcost means that the first improvement can only be seen after 10 Julius calls.

    On the contrary, the first visible WER decrease in the case of Downhill Simplexcomes after nearly 40 evaluation calls. This slow progression is expected sincethe method Downhill Simplex suffers from simplex initialization cost.

    In the next set of experiments, SPSA is applied to the Kaldi decoder withtwo different decoding paradigms SGMMs and DNNs. Although the WERenhancement is less noticeable, the results in Table 5.7 imply an agreementwith previous experiments as the WER values decrease for every test set.

    #eval WER WER WER WER WERDev. DiSCo DiSCo LinkedTV LinkedTV

    planned spont. planned spont.GMM-HMM (Julius) 1 29.6 24.8 31.3 26.5 49.8GMM-HMM (Julius) + SPSA 40 27.7 22.6 28.4 24.5 45.6DNN (Kaldi) 1 23.9 18.4 22.6 21.0 37.8DNN (Kaldi) + SPSA 44 23.8 18.2 22.4 20.7 37.7SGMM (Kaldi) 1 23.5 18.1 22.5 20.8 36.6SGMM (Kaldi) + SPSA 34 23.0 17.6 22.0 20.5 36.4

    Table 5.7 WER [%] results of different ASR paradigms with standard settingand SPSA adapted parameters.

  • 7/26/2019 Thach Thesis

    49/67

    5.2. Automatic Speech Recognition Task 39

    0 10 20 30 4027

    28

    29

    30

    31

    32

    #eval

    WER(%)

    Downhill Simplex

    SPSAEvolutional Strategies

    Simulated Annealing

    Figure 5.1 Comparison between Downhill Simplex, SPSA, Evolution Strat-egy and Simulated Annealing after 41 Julius calls (#eval). Eachdot represents one iteration.

    5.2.5 Time-constrained Optimization

    The experiment results from the previous subsection reveal that the WER im-provement comes at a result of longer processing time (the increase of RTFvalues) as showed in Table 5.6. In practice, it is not uncommon for a speechrecognition application to focus more on speed than accuracy. Therefore, thetime-constrained optimization experiments conducted in this subsection aimto solve that problem with a penalized objective function. The coefficientk inEq. 5.4is now assigned with a non-zero positive value which can be roughlyunderstood as how strict the constraint is. Again, the time-constrained ob-jective function is

    L() = WER() +k max(0, RTF() t) (5.6)

    which is similar to Eq.5.4except thatk must be strictly greater than 0.

  • 7/26/2019 Thach Thesis

    50/67

    40 5. Evaluation

    Approach

    To control the k value, the experiments employed three di

    ff

    erent approachesfrom which three different objective functions are derived, namely delta, in-creasing[35] and adaptive. The first one employs a fixed k while the other twoadjust the coefficient for every iteration. However, the k adjustment methodsare only suitable with SPSA since the the number of function evaluations ineach iteration of a SPSA optimization run is unchanging (i.e. always 2) for anyproblem. The cost of each iteration in the Downhill Simplex process dependson the success of the transformation operations. For Simulated Annealingmethod, it relies on the maximum number of rejected (or accepted) solutionsallowed at any temperature. On the other hand, CMA-ES algorithm iterationcost is determined based on the offspring population size. The variation of

    iteration cost makes it difficult to control the behaviour ofkin these methods,thus only the delta objective function is suitable for the latter three algorithms.

    The details of increasing and adaptive method are as follows.

    Increasing: At the beginning, set k = 1. Until RTF reaches thresholdt, setk = k+ 1 at each iteration. When RTF reaches threshold t, fix k with thecurrent value.

    Adaptive: At the beginning, set k= 1. If RTF > t, setk= k+ 1, otherwiseifk >1 setk=k 1.

    The last time-constrained method to be considered in time-constrained param-eter optimization is Gradient Descent which was described in Section4.5. Asin the description, the method requires an initial point which can be foundby unconstrained search. In the next set of experiments, the solution foundby SPSA unconstrained experiment is employed as initial point for GradientDescent optimization.

    Evaluation

    The time-constrained experiment results on Julius are presented in Table5.8

    and Table 5.9. The first table contains the results of the delta RTF penalty(fixedk) approaches combining with all algorithms (i.e. SPSA, Downhill Sim-plex, Simulated Annealing, Evolution Strategies) while the latter presents kadjustment methods combining with SPSA(i.e. SPSA increasing and SPSAadaptive). In other words, the experiment set in Table 5.8employed the samedelta objective function withk= 10. Table5.9 also includes the solution withlowest RTF found by the Gradient Descent method (with unconstrained SPSAsolution as the initial point). It can easily be seen that all approaches exceptSPSA + Gradient Descentsuccessfully reach the target 3.0 of RTF on the de-velopment set. Note that the Gradient Descent method contains no coefficient

  • 7/26/2019 Thach Thesis

    51/67

    5.2. Automatic Speech Recognition Task 41

    implies the RTF threshold, the result in the table simply means that it cannotfind any solution with the RTF value better than 3.2. Nonetheless, these im-

    provements on RTF come at a price of less accuracy (increase of WER) withthe exception of Gradient Descent, SPSA increasing and CMA-ES delta. Inparticular, the later two approaches succeed in lowering RTF to the target valueand slightly enhancing the accuracy at the same time. On the other hand, theDownhill Simplex deltaresult shows a leap of the WER value from 29.6 to 36.8thus implying that this approach is not suitable for the time-constrained task.It is arguably due to the noisy measurement. The improvements of decodingspeed also reflect on all evaluation sets as seen in every row of the tables.

    Julius #eval WER(RTF) WER(RTF) WER(RTF) WER(RTF) WER(RTF)Dev. DiSCo DiSCo LinkedTV LinkedTV

    planned spont. planned spont.+ baseline 1 29.6(4.5) 24.8(4.3) 31.3(4.3) 26.5(4.8) 49.8(5.6)+ DS delta 94 36.8(3.0) 31.9(3.5) 40.4(3.0) 34.4(3.5) 59.1(3.8)+ SPSA delta 100 31.2(2.8) 25.8(2.4) 32.0(3.1) 28.7(2.7) 49.9(2.8)+ CMA-ES delta 151 29.3(2.9) 23.9(2.7) 29.0(3.0) 26.1(3.0) 47.1(3.5)+ SA delta 89 30.7(2.9) 25.0(2.8) 30.6(3.7) 28.0(3.0) 48.5(3.1)

    Table 5.8 Optimization results with RTF target t = 3.0 andk = 10 on theJulius decoder.

    Julius #eval WER(RTF) WER(RTF) WER(RTF) WER(RTF) WER(RTF)Dev. DiSCo DiSCo LinkedTV LinkedTV

    planned spont. planned spont.+ baseline 1 29.6(4.5) 24.8(4.3) 31.3(4.3) 26.5(4.8) 49.8(5.6)+ SPSA increasing 94 29.5(2.9) 23.7(2.7) 29.4(3.1) 26.1(3.0) 47.6(3.5)+ SPSA adaptive 98 29.7(2.9) 24.3(3.1) 30.0(3.2) 26.9(2.9) 48.0(3.7)+ SPSA + GD 60 29.2(3.2) 23.4(3.3) 29.1(3.6) 25.6(3.6) 47.2(4.1)

    Table 5.9 Optimization results with RTF target t = 3.0 on the Julius de-coder.

    Figure5.2(a),5.2(b),5.3(a)and5.3(b)show the WER and RTF developmentsof the successful optimizations in Table 5.8 and Table 5.9. Only DownhillSimplex delta and SPSA + Gradient Descentresults are excluded since theirrespective final solutions are inferior in terms of quality. In the figures, theyellow lines present the behaviour of the RTF progression. It can easily beseen that SPSA-based methods reach the RTF threshold (the light blue area)faster than theCMA-ES deltaapproach. It takes only 22 function evaluationsfor the RTF value lowering to 3.0 in Figure 5.3(a). Similar phenomenon canbe observed in other SPSA-based approaches, i.e., SPSA adaptive and SPSAdelta. On the other hand, the CMA-ES deltamethod requires a roughly triple

  • 7/26/2019 Thach Thesis

    52/67

    42 5. Evaluation

    cost of approximately 60 function evaluations to achieve the same target ofRTF. This result implies that SPSA-based method indeed hold advantage when

    taking convergence speed to consideration.The SPSA adaptive and SPSA increasing experiments are then repeated inTable5.10with a more powerful system. Their performance details can be ob-served in Figure5.4(a)and5.4(b). In this case, the improvements can easily beseen on both WER and RTF developments. Nonetheless, the SPSA increasingmethod still yields better solution in terms of both speed and accuracy.

    GMM-HMM(Julius) #eval RTF WER WER WER WER WERDev. DiSCo DiSCo LinkedTV LinkedTV

    planned spont. planned spont.

    + baseline 1 4.2 29.6 24.8 31.3 26.5 49.8+ SPSA inc 86 2.8 28.9 23.1 28.5 25.2 46.5+ SPSA adap 86 3.0 29.1 23.0 28.1 25.6 46.3

    Table 5.10 WER [%] results of Julius with standard setting and SPSAadapted parameters.

    To test the efficiency of the SPSA increasing method on different decodingparadigms, the unconstrained experiments were repeated on the decoder Kaldiwith 2 different implementations (i.e. DNNs and SGMMs) in Table5.11. Thethreshold t is set to 0.8 and 1.6 respectively. In both cases the SPSA increas-

    ing method successfully reduces RTF to the respective target. However, themethod fails to avoid the degradation in accuracy. On the development set,the WER value raises from 23.9 to 24.2 for the DNN implementation and from23.5 to 24.3 for the SGMM implementation. The speech recognizer accuracyalso degrades accordingly on the evaluation sets.

    Kaldi #eval RTF WER WER WER WER WERDev. DiSCo DiSCo LinkedTV LinkedTV

    planned spont. planned spont.+ DNN + baseline 1 1.1 23.9 18.4 22.6 21.0 37.8

    + DNN + SPSA inc 16 0.8 24.2 18.7 22.8 21.2 38.1+ SGMM + baseline 1 2.5 23.5 18.1 22.5 20.8 36.6+ SGMM + SPSA inc 28 1.6 24.3 18.9 23.3 21.2 38.0

    Table 5.11 WER [%] results of Kaldi with standard setting and SPSAadapted parameters.

  • 7/26/2019 Thach Thesis

    53/67

    5.2. Automatic Speech Recognition Task 43

    0 20 40 60 80 100

    3

    4

    5

    6

    RTF

    WER

    RTF

    0 20 40 60 80 10028

    30

    32

    34

    36

    Evaluation

    WER(%)

    (a) SPSA with delta RTF penalty.

    0 20 40 60 80 100 120 140 160

    3

    4

    5

    6

    RTF

    WER

    RTF

    0 20 40 60 80 100 120 140 16028

    29

    30

    31

    32

    33

    34

    35

    #eval

    WER(%)

    (b) CMA-ES with delta RTF penalty.

    Figure 5.2 WER and RTF development of experiments in Table5.8

  • 7/26/2019 Thach Thesis

    54/67

    44 5. Evaluation

    0 20 40 60 80 100

    3

    4

    5

    6

    RTF

    WER

    RTF

    0 20 40 60 80 10028

    29

    30

    31

    32

    33

    34

    35

    #eval

    WER(%)

    (a) SPSA with increasing RTF penalty.

    0 20 40 60 80 100

    3

    4

    5

    6

    RTF

    WER

    RTF

    0 20 40 60 80 10028

    29

    30

    31

    32

    33

    34

    35

    #eval

    WER(%)

    (b) SPSA with adaptive RTF penalty.

    Figure 5.3 WER and RTF development of experiments in Table5.9

  • 7/26/2019 Thach Thesis

    55/67

    5.2. Automatic Speech Recognition Task 45

    0 10 20 30 40

    3

    4

    5

    6

    fixed k

    RTF

    WER

    RTF

    0 10 20 30 4028

    29

    30

    31

    32

    33

    34

    35

    Iteration

    WER(%)

    (a) Increasing RTF penalty.

    0 10 20 30 4028

    29

    30

    31

    32

    33

    34

    35

    Iteration

    WER(%)

    0 10 20 30 40

    3

    4

    5

    6

    RTF

    WER

    RTF

    (b) Adaptive RTF penalty.

    Figure 5.4 WER and RTF development of experiments in Table5.10

  • 7/26/2019 Thach Thesis

    56/67

    46 5. Evaluation

  • 7/26/2019 Thach Thesis

    57/67

    6Conclusion

    By conducting numerous experiments focusing on the decoding parameter op-timization task in automatic speech recognition, this thesis investigated theperformances of five optimization methods, namely Simultaneous Perturba-tion Stochastic Approximation, Downhill Simplex, Simulated Annealing, Evo-lution Strategies, and Gradient Descent. The goal of the optimization is toimprove the performance of automatic speech recognition system expressedin terms of WER and RTF. The unconstrained optimization disregards RTFwhile time-constrained optimization considers both. This thesis also suggestsseveral approaches to design time-constrained optimization experiment basedon the works of El Hannani and Hain [8] and Stein et al. [35]. This includesthe delta objective function, the increasing objective function, and the adap-tive function in addition to the combination of the SPSA and the GradientDescent methods.

    The investigation in this thesis focused on the qualities of the optimized solu-tions and the speed of optimization algorithms in terms of decoder calls (i.e.objective function calls).

    In summary, the contributions of this thesis are:

    Describing the employed gradient-free optimization algorithms.

    Introduction of several objective functions for the parameter optimizationtask.

    Successfully employing the optimization algorithms to enhance the perfor-mance of 2 speech recognizers, namely Julius and Kaldi, in terms of speed andaccuracy.

  • 7/26/2019 Thach Thesis

    58/67

    48 6. Conclusion

    The following conclusion of this thesis is based on the results of the experi-ments:

    Generally, all tested algorithms successfully enhanced the automatic speechrecognition performance (i.e. decrease of the Word Error Rate and(or) speed-up the decoder) in comparison with the performance of the manual adjustmentparameters (the baseline parameter set). For unconstrained optimization, theSimulated Annealing method is capable to offer quality solution if optimiza-tion cost is not a concern. On the other hand, given limited time, the SPSAmethod has strong advantage since it is the fastest method in improving theWER value, it is also the fasted method in reducing the RTF value whencombining with the increasing RTF penalty objective function. The CMA-EStechnique also offered remarkable performance, in which provided relatively

    good solution with an acceptable optimization cost, for both unconstrainedand time-constrained settings. Furthermore, one should note that the methodDownhill Simplex is not suitable in the time-constrained test, arguably due tothe noisy measurement.

  • 7/26/2019 Thach Thesis

    59/67

    Bibliography

    [1] Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., and Mohri,M. Openfst: A general and efficient weighted finite-state transducer li-

    brary. In Implementation and Application of Automata, J. Holub andJ. Zadrek, Eds., vol. 4783 ofLecture Notes in Computer Science. SpringerBerlin Heidelberg, 2007, pp. 1123.

    [2] Alleva, F., Huang, X., and Hwang, M.-Y. An improved search al-gorithm using incremental knowledge for continuous speech recognition. InAcoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEEInternational Conference on(1993), vol. 2, pp. 307310 vol.2.

    [3] Baker, J. K. Readings in speech recognition. Morgan Kaufmann Pub-lishers Inc., San Francisco, CA, USA, 1990, ch. Stochastic Modeling for

    Automatic Speech Understanding, pp. 297307.

    [4] Baum, D., Schneider, D., Bardeli, R., Schwenninger, J., Sam-lowski, B., Winkler, T., and Kohler, J. DiSCo A GermanEvaluation Corpus for Challenging Problems in the Broadcast Domain. InProc. Seventh conference on International Language Resources and Eval-uation (LREC) (Valletta, Malta, may 2010).

    [5] Chan, A., Ravishankar, M., Rudnicky, A. I., and Sherwani,J. Four-layer categorization scheme of fast gmm computation techniquesin large vocabulary continuous speech recognition systems. In INTER-SPEECH, ISCA.

    [6] Davis, S., and Mermelstein, P. Comparison of parametric repre-sentations for monosyllabic word recognition in continuously spoken sen-tences. Acoustics, Speech and Signal Processing, IEEE Transactions on28, 4 (Aug 1980), 357366.

    [7] Dianati, M., Song, I., and Treiber, M. An introduction to ge-netic algorithms and evolution strategies. University of Waterloo, Canada(2002).

  • 7/26/2019 Thach Thesis

    60/67

    50 Bibliography

    [8] El Hannani, A., and Hain, T. Automatic optimization of speechdecoder parameters. Signal Processing Letters, IEEE 17, 1 (2010), 9598.

    [9] Hansen, N. The cma evolution strategy: A comparing review. In To-wards a New Evolutionary Computation, J. Lozano, P. Larranaga, I. Inza,and E. Bengoetxea, Eds., vol. 192 ofStudies in Fuzziness and Soft Com-puting. Springer Berlin Heidelberg, 2006, pp. 75102.

    [10] Hermansky, H., Hanson, B., and Wakita, H. Perceptually basedlinear predictive analysis of speech. In Acoustics, Speech, and SignalProcessing, IEEE International Conference on ICASSP 85. (Apr 1985),vol. 10, pp. 509512.

    [11] Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A.,Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath,

    T., and Kingsbury, B. Deep neural networks for acoustic modelingin speech recognition: The shared views of four research groups. IEEESignal Processing Magazine 29, 6 (2012), 8297.

    [12] Johnston, P. R., and Pilgrim, D. H. Parameter optimization forwatershed models. Water Resources Research 12, 3 (1976), 477486.

    [13] Kacur, J., and Korosi, J. An accuracy optimization of a dialog asrsystem utilizing evolutional strategies. In Image and Signal Processingand Analysis(2007), pp. 180184.

    [14] Kiefer, J., and Wolfowitz, J.Stochastic estimation of the maximumof a regression function. The Annals of Mathematical Statistics 23, 3 (091952), 462466.

    [15] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Optimizationby simulated annealing. Science 220, 4598 (1983), 671680.

    [16] Lee, A., Kawahara, T., and Shikano, K. Julius an Open SourceReal-Time Large Vocabulary Recognition Engine. In Proceedings of Eu-rospeech(Aalborg, Denmark, 2001), pp. 16911694.

    [17] Levenshtein, V. Binary Codes Capable of Correcting Deletions, Inser-tions and Reversals. Soviet Physics Doklady 10(1966), 707.

    [18] Macherey, W. Discriminative Training and Acoustic Modeling forAutomatic Speech Recognition. PhD thesis, RWTH Aachen University,Aachen, Germany, Mar. 2010.

    [19] Mak, B., and Ko, T. Automatic estimation of decoding parame-ters using large-margin iterative linear programming. InINTERSPEECH(2009), pp. 12191222.

  • 7/26/2019 Thach Thesis

    61/67

    Bibliography 51

    [20] Maryak, J. L., and Chin, D. C. Global random optimization bysimultaneous perturbation stochastic approximation. In in Proc. Amer.

    Control Conf(2001), pp. 756762.

    [21] Matear, R. J.Parameter optimization and analysis of ecosystem modelsusing simulated annealing: A case study at station p. Journal of MarineResearch 53, 4 (1995-07-01T00:00:00).

    [22] Moles, C. G., Mendes, P., and Banga, J. R. Parameter estimationin biochemical pathways: a comparison of global optimization methods.Genome research 13, 11 (2003), 24672474.

    [23] Nelder, J., and Mead, R. The Downhill Simplex Method. Computer

    Journal 7(1965), 308.

    [24] Ney, H., and Aubert, X. Dynamic programming search strategies:From digit strings to large vocabulary word graphs. In Automatic Speechand Speaker Recognition, C.-H. Lee, F. Soong, and K. Paliwal, Eds.,vol. 355 ofThe Kluwer International Series in Engineering and ComputerScience. Springer US, 1996, pp. 385411.

    [25] Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K.,Ghoshal, A., Glembek, O., Goel, N. K., Karafiat, M., Ras-

    trow, A., Rose, R. C., Schwarz, P., and Thomas, S. Subspace

    gaussian mixture models for speech recognition. InProc. ICASSP (2010),pp. 43304333.

    [26] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glem-bek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y.,

    Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. TheKaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Auto-matic Speech Recognition and Understanding (Dec. 2011), IEEE SignalProcessing Society. IEEE Catalog No.: CFP11SRW-USB.

    [27] Rios, L., and Sahinidis, N. Derivative-free optimization: a review ofalgorithms and comparison of software implementations. Journal of GlobalOptimization 56, 3 (2013), 12471293.

    [28] Rogers, S. P. A. Parallel speech recognition. International Journal ofParallel Programming 27 (1999).

    [29] Schwenninger, J., Stein, D., and Stadtschnitzer, M.Automaticparameter tuning and extended training material: Recent advances in thefraunhofer speech recognition system. In Proc. Workshop Audiosignal-und Sprachverarbeitung (2013).

  • 7/26/2019 Thach Thesis

    62/67

    52 Bibliography

    [30] Spall, J., Hill, S., and Stark, D. Theoretical framework for compar-ing several popular stochastic optimization approaches. InAmerican Con-

    trol Conference, 2002. Proceedings of the 2002(2002), vol. 4, pp. 31533158 vol.4.

    [31] Spall, J. C. An overview of the simultaneous perturbation method forefficient optimization.

    [32] Spall, J. C.Multivariate stochastic approximation using a simultaneousperturbation gradient approximation. IEEE Transactions on AutomaticControl 37:3(Mar. 1992).

    [33] Spall, J. C. Implementation of the simultaneous perturbation algorithmfor stochastic optimization. IEEE Transactions on Aerospace and Elec-tronic Systems 34:3(July 1998).

    [34] Spendley, W., Hext, G. R., and Himsworth, F. R. Sequentialapplication of simplex designs in optimisation and evolutionary operation.Technometrics 4, 4 (Nov. 1962), 441461.

    [35] Stein, D., Schwenninger, J., and Stadtschnitzer, M. Simulta-neous perturbation stochastic approximation for automatic speech recog-nition.

    [36] Surjanovic, S., and Bingham, D. Virtual library of simulation exper-

    iments: Test functions and datasets. Retrieved February 10, 2014, fromhttp://www.sfu.ca/~ssurjano.

    http://www.sfu.ca/~ssurjanohttp://www.sfu.ca/~ssurjanohttp://www.sfu.ca/~ssurjano
  • 7/26/2019 Thach Thesis

    63/67

    AAPPENDIX

    Detail information of the benchmark functions employed in Section5.1exper-iments. The Figures were taken from [36].

    Ackley Function

    L(x1 xp)