Discriminatively Structured Graphical Models for Speech ...ssli.ee.washington.edu/ssli/people/karim/papers/GMRO-final-rpt.pdf · Discriminatively Structured Graphical Models for Speech

Discriminatively Structured Graphical Models for SpeechRecognition

The Graphical Models TeamJHU 2001 Summer Workshop

Jeff A. Bilmes — University of Washington, SeattleGeoff Zweig — IBMThomas Richardson — University of Washington, SeattleKarim Filali — University of Washington, SeattleKaren Livescu — MITPeng Xu — Johns Hopkins UniversityKirk Jackson — DODYigal Brandman — Phonetact Inc.Eric Sandness — SpeechworksEva Holtz — Harvard UniversityJerry Torres — Stanford UniversityBill Byrne — Johns Hopkins University

Summer, 2001

Abstract

In recent years there has been growing interest in discriminative parameter training techniques, resulting fromnotable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switch-board. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, thesemethods assume a fixed statistical modeling structure, and then optimize only the associated numerical parameters(such as means, variances, and transition matrices). Such is also the state of typical structure learning and modelselection procedures in statistics, where the goal is to determine the structure (edges and nodes) of a graphical model(and thereby the set of conditional independence statements) that best describes the data.

This report describes the process and results from the 2001 Johns Hopkins summer workshop on graphical models.Specifically, in this report we explore the novel and significantly different methodology of discriminativestructurelearning. Here, the fundamental dependency relationships between random variables in a probabilistic model arelearned in a discriminative fashion, and are learned separately and in isolation from the numerical parameters. Theresulting independence properties of the model might in fact be wrong with respect to the true model, but are madeonly for the sake of optimizing classification performance. In order to apply the principles of structural discrim-inability, we adopt the framework of graphical models, which allows an arbitrary set of random variables and theirconditional independence relationships to be modeled at each time frame.

We also, in this document, describe and present results using a new graphical modeling toolkit (GMTK). UsingGMTK and discriminative structure learning heuristics, the results presented herein indicate that significant gainsresult from discriminative structural analysis of both conventional MFCC and novel AM-FM features on the Auroracontinuous digits task. Lastly, we also present results using GMTK on several other tasks, such as on an IBMaudio-video corpus, preliminary results on the SPINE-1 data set using hidden noise variables, on hidden articulatorymodeling using GMTK, and on the use of interpolated language models represented by graphs within GMTK.

1

http://ssli.ee.washington.edu/~bilmes

http://stat.washington.edu/tsr

http://ssli.ee.washington.edu/ssli/people/karim/

http://www.sls.lcs.mit.edu/sls/about/people/livescu.html

http://www.clsp.jhu.edu/people/byrne/

http://ssli.ee.washington.edu/~bilmes/gmtk

Contents

1 Introduction 4

2 Overview of Graphical Models (GMs) 42.0.1 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.0.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72.0.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.0.4 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

2.1 Efficient Probabilistic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

3 Graphical Models for Automatic Speech Recognition 11

4 Structural Discriminability: Introduction and Motivation 11

5 Explicit vs. Implicit GM-structures for Speech Recognition 195.1 HMMs and Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195.2 A more explicit structure for Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235.3 A more explicit structure for training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .255.4 Rescoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .265.5 Graphical Models and Stochastic Finite State Automata . . . . . . . . . . . . . . . . . . . . . . . . .27

6 GMTK: The graphical models toolkit 286.1 Toolkit Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

6.1.1 Explicit vs. Implicit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .286.1.2 The GMTKL Specification Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .296.1.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .306.1.4 Logarithmic Space Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .306.1.5 Generalized EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .306.1.6 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .316.1.7 Switching Parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .316.1.8 Discrete Conditional Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .316.1.9 Graphical Continuous Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . .31

7 The EAR Measure and Discriminative Structure learning Heuristics 327.1 Basic Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .327.2 Selecting the optimal number of parents for an acoustic feature X . . . . . . . . . . . . . . . . . . . .327.3 The EAR criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .347.4 Class-specific EAR criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .357.5 Optimizing the EAR criterion: heuristic search . . . . . . . . . . . . . . . . . . . . . . . . . . . . .357.6 Approximations to the EAR measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

7.6.1 Scalar approximation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .367.6.2 Scalar approximation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .367.6.3 Scalar approximation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

8 Visualization of Mutual Information and the EAR measure 378.1 MI/EAR: Aurora 2.0 MFCCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .388.2 MI/EAR: IBM A/V Corpus, LDA+MLLT Features . . . . . . . . . . . . . . . . . . . . . . . . . . .418.3 MI/EAR: Aurora 2.0 AM/FM Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .428.4 MI/EAR: SPINE 1.0 Neural Network Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

9 Visualization of Dependency Selection 44

2


10 Corpora description and word error rate (WER) results 4610.1 Experimental Results on Aurora 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

10.1.1 Baseline GMTK vs. HTK result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4610.1.2 A simple GMTK Aurora 2.0 noise clustering experiment . . . . . . . . . . . . . . . . . . . .4710.1.3 Aurora 2.0 different features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4810.1.4 Mutual Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4810.1.5 Induced Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4910.1.6 Improved Word Error Rate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

10.2 Experimental Results on IBM Audio-Visual (AV) Database . . . . . . . . . . . . . . . . . . . . . . .4910.2.1 Experiment Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5010.2.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5110.2.3 IBM AV Experiments in WS’01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5110.2.4 Experiment Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5210.2.5 Matching Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5210.2.6 GMTK Simulating an HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5210.2.7 EAR Measure of Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

10.3 Experimental Results on SPINE-1: Hidden Noise Variables . . . . . . . . . . . . . . . . . . . . . . .54

11 Articulatory Modeling with GMTK 5511.1 Articulatory Models of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5611.2 The Articulatory Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5611.3 Representing Speech with Articulatory Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5711.4 Articulatory Graphical Models for Automatic Speech

Recognition: Workshop Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

12 Other miscellaneous workshop accomplishments 6212.1 GMTK Parallel Training Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62

12.1.1 Parallel Training:emtrain parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6212.1.2 Parallel Viterbi Alignment:Viterbi align parallel . . . . . . . . . . . . . . . . . . 6412.1.3 Exampleemtrain parallel header file . . . . . . . . . . . . . . . . . . . . . . . . . . .6412.1.4 ExampleViterbi align parallel header file . . . . . . . . . . . . . . . . . . . . . . 67

12.2 The Mutual Information Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6912.2.1 Mutual Information and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6912.2.2 Toolkit Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7012.2.3 EM for MI estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7012.2.4 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

12.3 Graphical Model Representations of Language Model Mixtures . . . . . . . . . . . . . . . . . . . . .7112.3.1 Graphical Models for Language Model Mixtures . . . . . . . . . . . . . . . . . . . . . . . .7112.3.2 Perplexity Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7512.3.3 Perplexity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7512.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

13 Future Work and Conclusions 7613.1 Articulatory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

13.1.1 Additional Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7613.1.2 Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

13.2 Structural Discriminability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7713.3 GMTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

14 The WS01 GM-ASR Team 79

3

1 Introduction

In this report, we describe the results from the Johns Hopkins workshop that took place during a 6-week period overthe summer, 2001. During this time, novel research was performed in several different areas. These areas include:1) graphical models and their application to speech recognition; 2) the design, development and testing of a newgraphical-model based toolkit for this purpose to allow for the rapid exploration of a wide variety of different modelsfor ASR; 3) the exploration of a new method to discriminatively construct the graphical model structure (the nodes andedges of the graph); 5) the application of graphical models and structure learning to novel speech features, in particularstandard MFCCs and novel amplitude and frequency modulation features; 4) the initial evaluation of graphical modelson three data sets, the Aurora 2.0 noisy speech corpus, an IBM audio-visual corpus, and SPINE-1, the DARPA speechin noisy environments corpus; 5) the beginnings of the use of graphical models to represent anatomically correctarticulatory-based speech recognition; 5) the beginnings of the development of a software toolkit for computing mutualinformation and related quantities on large data sets, the toolkit is used both to visualize dependencies in these datasets and also to help determine graphical model structures; and 6) the application of graphical models and GMTK tothe problem of simple smoothed language models.

As a main goal of the workshop, the graphical model methodology developed attempted to optimize the structureof the network so as to improve classification performance (e.g., speech recognition error) rather than simply to betterdescribe or improve the likelihood of the data. This document will outline the theory, and describe the methodologyand results, both positive and negative, that took place during the 6 weeks of the workshop.

Broadly, this report is organized as follows: In Section 2, we first provide a general introduction to graphicalmodels, and briefly introduce the notation we will use throughout this document. Section 3 provides a broad overviewand introduction on how graphical models are a promising approach to the speech recognition task. Section 4 describesan introduction and motivation to one of the main goals of the workshop, that of structural discriminability. This sectionprovides a number of intuitive examples on why such an approach should yield improved performance, even whendiscriminative parameter training methods are not used. Section 5 describes in more detail the various ways in whichgraphical models can be used for speech recognition systems, namely theexplicit vs. theimplicit approach, and thevarious trade offs between the two. Section 6 provides an overview of GMTK, the graphical model toolkit, softwarethat was developed for use at the workshop that allows the rapid use of graphical models for language, speech, andother time-series processes. Section 7 develops a specific method to form structurally discriminative networks, anddescribes and provides a new derivation for the EAR measure, a quantity that is useful for this purpose. Section 8provides a number of examples of the visualization both of conditional mutual information and the EAR measureon a number of different corpora and speech feature sets. Section 9 contains the visualization of the result of usingthe EAR measure to induce discriminative structure on the three corpora that were used in this study. Section 10describes in more detail the three corpora that were used, baseline results and other results, and improved results usingstructure determination. Section 11 describes articulatory based speech recognition (another of the workshop goals)and how GMTK can be used to represent hidden articulatory models for speech recognition. Section 12 describesa number of other workshop accomplishments, including: 1) The GMTK parallel training/testing scripts developedat the workshop (Section 12.1); the beginnings of the development of the mutual-information toolkit (Section 12.2),which can be used to compute general mutual information quantities between discrete/continuous random variables,and 3) the beginnings of the application of graphical models to representing mixtures of different order languagemodels (Section 12.3). Section 13 concludes and describes future work, and lastly in Section 14 describes the WS01GM team.

2 Overview of Graphical Models (GMs)

Broadly speaking, graphical models (GMs) offer two primary features to those interested in working with statisticalsystems. On the one hand, a GM may be viewed as an abstract, formal, and visual language that can depict importantproperties (conditional independence) of natural systems and signals when described by multi-variate random pro-cesses. There are mathematically precise rules that describe what a given graph means, rules which associate with agraph a family of probability distributions. Natural signals (those which are not purely random) have significant sta-tistical structure, and this can occur at multiple levels of granularity. Graphs can show anything from causal relationsbetween high-level concepts [84] down to the fine-grained dependencies existing within the neural code [3]. On theother hand, along with GMs come a set of algorithms for efficiently performing probabilistic inference and decision

4

making. Typically intractable, the GM inference procedures and their approximations exploit the inherent structurein a graph in a way that can significantly reduce computational and memory demands, thereby making probabilisticinference as fast as possible.

Simply put, graphical models in one way or another describe conditional independence properties amongst col-lections of random variables. A given GM is identical to a list of conditional independence statements, and a graphrepresents all distributions for which all these independence statements are true. A random variableX is conditionallyindependent of a different random variableY given a third random variableZ under a given probability distributionp(·), if the following relation holds:

p(X = x, Y = y|Z = z) = p(X = x|Z = z)p(Y = y|Z = z)

for all x, y, andz. This is writtenX⊥⊥Y |Z and it is said that “X is independent ofY given Z underp(·)”. Thishas the following intuitive interpretation: if one has knowledge ofZ, then knowledge ofY does not change one’sknowledge ofX and vice versa. Conditional independence is different from unconditional (or marginal) indepen-dence. Therefore, neitherX⊥⊥Y impliesX⊥⊥Y |Z nor vice versa. Conditional independence is a powerful concept— using conditional independence, a statistical model can undergo enormous simplifications. Moreover, even thoughconditional independence might not hold for certain signals, making such assumptions might yield vast improvementsbecause of computational, data-sparsity, or task-specific reasons (e.g., consider the hidden Markov model with as-sumptions which obviously do not hold for speech [6], but which nonetheless empirically appear to be somewhatbenign, and at times even helpful as described in Section 4 and [7]). Formal properties of conditional independence,and many other equivalent mathematical formulations, are described in [69, 84].

A GM [69, 25, 105, 84, 60] is a graphG = (V,E) whereV is a set of vertices (also called nodes or randomvariables) and the set of edgesE is a subset of the setV × V . The graph describes an entirefamily of probabilitydistributions over the variablesV . A variable can either be scalar- or vector-valued, where in the latter case the vectorvariable implicitly corresponds to a sub-graphical model over the elements of the vector. The edgesE, depending onthe graph semantics (see below), specifies a set of conditional independence properties over the random variables. Theproperties specified by the GM are true for all members of its associated family.

Four items must be specified when using a graph to describe a particular probability distribution [11]: the GMsemantics, structure, implementation, andparameterization. The semantics and the structure of a GM are inherentto the graph itself, while the implementation and parameterization are implicit within the underlying model. Each ofthese are now described in turn.

2.0.1 Semantics

There are many types of GMs, each one with differing semantics. The set of conditional independence assumptionsspecified by a particular GM, and therefore the family of probability distributions it represents, will be differentdepending on the type of GM currently being considered. The semantics specifies a set of rules about what is or is nota valid graph and what set of distributions correspond to a given graph. Various types of GMs include directed models(or Bayesian networks) [84, 60],1 undirected networks (or Markov random fields) [19], factor graphs [40, 68], chaingraphs [69, 90] which are combinations of directed and undirected GMs, causal models [85], dependency networks[52], and many others. When the semantics of a graph change, the family of distributions it represents also changes,but overlap can exist between certain families (i.e., there might be a probability distribution that has a representation bytwo different types of graphical model). This also means that the same exact graph (i.e., the actual graphical picture)might represent very different families of probabilities depending on the current semantics. Therefore, when using aGM, it is critical to first agree upon the semantics that is currently being used.

A Bayesian network (BN) [84, 60, 51] is one type of GM where the graph edges are directed and acyclic. In a BN,edges point from parent to child nodes, and the graph implicitly spells out a factorization that is a simplification of thechain rule of probability, namely:

p(X1:N ) =∏

i

p(Xi|X1:i−1) =∏

i

p(Xi|Xπi).

The first equality is the probabilistic chain rule, and the second equality holds under a particular BN, whereπi desig-nates nodei’s parents according to the BN. A probability distribution that is represented by a given BN will factorizewith respect to that BN, and this is called the directed factorization property [69].

1Note that the name “Bayesian network” does not imply Bayesian statistical inference. In fact, both Bayesian and non-Bayesian Bayesiannetworks may exist.

5

A Dynamic Bayesian Network (DBN) [29, 46, 43, 109] has exactly the same semantics as a BN, but is structuredto have a sequence of clusters of connected vertices, where edges between clusters point only in the direction ofincreasing time. DBNs are particularly useful to describe time signals such as speech. GMTK, in fact, is a general toolthat allows users to experiment with DBNs.

Several equivalent schemata exist that formally define a BN’s conditional independence relationships [69, 84, 60].The idea of d-separation (or directed separation) is perhaps the most widely known: a set of variablesA is conditionallyindependent of a setB given a setC if A is d-separated fromB by C. D-separation holds if and only ifall pathsthat connect any node inA and any other node inB areblocked. A pathblockedif it has a nodev along the pathwith either: 1) the arrows along the path do not converge atv (i.e., serial or diverging atv) andv ∈ C, or 2) thearrows along the pathdo converge atv, and neitherv nor any descendant ofv is in C. From d-separation, onemay “read off” a list of conditional independence statements from a graph. This set of probability distributions forwhich this list of statements is true is precisely the set of distributions represented by the graph. Graph propertiesequivalent to d-separation include the directed local Markov property [69] (a variable is conditionally independent ofits non-descendants given its parents), factorization according to the graph, and the Bayes-ball procedure [96] (shownin Figure 1).

Figure 1: The Bayes-ball procedure makes it easy to answer questions about a given BN such as “isXA⊥⊥XB |XC?”,whereXA, XB , andXC are disjoint sets of nodes in a graph. The answer is true if and only if an imaginary ballbouncing from node to node along the edges in a graph and starting at any node inXA can reach any inXB . The ballmust bounce according to the rules as depicted in the figure. Only the nodes inXC are shaded. A ball may bouncethrough a node to another node depending both on its shading and the direction of its edges. The dashed arrows depictwhether a ball, when attempting to bounce through a given node, may bounce through that node or if it is blocked andmust bounce back to the beginning.

Conditional independence properties in undirected graphical models (UGMs) are much simpler than for BNs, andare specified using graph separation. For example, assuming thatXA, XB , andXC are disjoint sets of nodes in anUGM, XA⊥⊥XB |XC is true when all paths from any node inXA to any node inXB intersect some node inXC . In aUGM, a distribution may be described as a factorization of potential functions over the cliques in the graph.

BNs and DGMs are not the same, meaning that they correspond to different families of probability distributions.Despite the fact that BNs have complicated semantics, they are useful for a variety of reasons. One is that BNs can havea causal interpretation, where if nodeA is a parent ofB, A can be thought of as a cause ofB. A second reason is thatthe family of distributions associated with BNs is not the same as the family associated with UGMs — there are someuseful probability models, for example, that are concisely representable with BNs but which are not representableat all with UGMs (and vice versa). UGMs and BNs do have an overlap, however, and the family of distributionscorresponding to this intersection is known as the decomposable models [69]. These models have important propertiesrelating to efficient probabilistic inference and graph type (namely, triangulated graphs and the existence of a junctiontree).

In general, a lack of an edge between two nodes does not imply that the nodes are independent. The nodes mightbe able to influence each other indirectly via an indirect path. Moreover, the existence of an edge between two nodesdoesnot imply that the two nodes are necessarily dependent — the two nodes could still be independent for certainparameter values or under certain conditions (e.g., zeros in the parameters, see later sections). A GM guaranteesonly that the lack of an edge implies some conditional independence property, determined according to the graph’ssemantics. It is therefore best, when discussing a given GM, to refer only to its (conditional) independence rather thanits dependence properties. If one must refer to a directed dependence betweenA andB, it is perhaps better to say

6

simply that there is an edge (directed or otherwise) betweenA andB.Originally BNs were designed to represent causation, but more recently, models with semantics [85] that more

precisely representing causality have been defined. Other directed graphical models have been designed as well [52],and can be thought of as the general family of directed graphical models (DGMs).

A B C

A

B

C A

B

C

Figure 2: This figure shows four BNs with different arrow directions over the same random variables,A, B, andC.On the left side, the variables form a three-variable first-order Markov chainA → B → C. In the middle graph, thesame conditional independence statement is realized even though one of the arrow directions has been reversed. Boththese networks state thatA⊥⊥C|B. These two networks do not, however, insist thatA andB are not independent. Theright network corresponds to the propertyA⊥⊥C but it does not imply thatA⊥⊥C|B. Perhaps show the a DGM notrepresentable by a UGM and vice versa.

2.0.2 Structure

A graph’s structure, the set of nodes and edges, determines the set of conditional independence properties for thegraph under a given semantics. Note that more than one GM might correspond to exactly the same conditionalindependence properties even though their structure is entirely different. In this case, multiple very different lookinggraphs correspond to the same family of probability distributions. In such cases, the various GMs are said to be Markovequivalent [102, 103, 53]. In general, it is not immediately obvious with large complicated graphs how to quicklyvisually determine if Markov equivalence holds, but algorithms are available which can determine the members of anequivalence class [102, 103, 78, 21].

Nodes in a graphical model can be eitherobserved, or hidden. If a variable is observed, it means that its value isknown, or that data (or “evidence”) is available for that variable. If a variable is hidden, it currently does not have aknown value, and all that is available is the conditional distribution of the hidden variables given the observed variables(if any). Hidden nodes are also called confounding, latent, or unobserved variables. Hidden Markov models are sonamed because they possess a Markov chain that, in many applications, contains only hidden variables.

A node may switch roles, and may sometimes be hidden and at other times be observed. With an HMM, forexample, the “hidden” chain might be observed during training (because a phonetic or state-level alignment has beenprovided) and hidden during recognition (because the hidden variable values are not known for test speech data).When making the query “isA⊥⊥B|C?”, it is implicitly assumed thatC is observed.A andB are the nodes beingqueried, and any other nodes in the network not listed in the query are considered hidden. Also, when a collection ofsampled data exists (say as a training set), some of the data samples might have missing values each of which wouldcorrespond to a hidden variable. The EM algorithm [30], for example, can be used to train the parameters of hiddenvariables.

Hidden variables and their edges reflect a belief about the underlying generative process lying behind the phe-nomenon that is being statistically represented. This is because the data for these hidden variables is either unavailable,is too costly or impossible to obtain, or might even not exist since the hidden variables might only be hypothetical (e.g.,specified by hand based on human-acquired knowledge or hypotheses about the underlying domain). Hidden variablescan be used to indicate the underlying causes behind an information source. In speech, for example, hidden variablescan be used to represent the phonetic or articulatory gestures, or more ambitiously, the originating semantic thoughtbehind a speech waveform.

Certain GMs allow for what are calledswitchingdependencies [45, 79, 11]. In this case, edges in a GM can changeas a function of other variables in the network. An important advantage of switching dependencies is the reductionin the required number of parameters needed by the model. Switching dependencies are also used in a new graphical

7

model-based toolkit for ASR [9] (see Section 6). A related construct allows GMs to have optimized local probabilityimplementations [42].

It is sometimes the case that certain observed variables are only used as conditional variables. For example,consider the graphB → A which implies a factorization of the joint distributionP (A,B) = P (A|B)P (B). In manycases, it is not necessary to represent the marginal distribution overB. In such casesB is a “conditional-only” variable,meaning is always and only to the right of the conditioning bar. In this case, the graph representsP (A|B). This can beuseful in a number of cases including classification (or discriminative modeling), where we might only be interestedin posterior distributions over the class random variable, or in situations where additional observations, sayZ, existwhich might be marginally independent of a class variable, sayC, but which, conditioned on other observations, sayX, are dependent. This can be depicted by the graphC → X ← Z, where it is assumed that the distribution overZis not represented.

Often, the true (or the best) structure for a given task is unknown. This can mean that either some of the edgesor nodes (which can be hidden) or both can be unknown. This has motivated research on learning the structure ofthe model from the data, with the general goal to produce a structure that accurately reflects the important statisticalproperties that exist in the data set. These can take a Bayesian [51, 53] or frequentist point of view [17, 67, 51].Structure learning is akin to both statistical model selection [71, 18] and data mining [28]. Several good reviews ofstructure learning are presented in [17, 67, 51]. Structure learning from a discriminative perspective, thereby producingwhat could be calleddiscriminative generative models, was proposed in [6].

In this report, in fact, a method that is entitled structural discriminability is given initial evaluation. In contrast totypical structure learning in graphical models, structural discriminability is an attempt to, within the space of graphstructures, find one that performs best at the classification task. This implies that certain dependency statements mightbe made by the model which are in general not true in the data, and are made only for the sake of classificationaccuracy. More on this is described in Sections 4 and 7.

Figure 3 depicts a topological hierarchy of both the semantics and structure of GMs, and shows where differentmodels fit into place.

Graphical Models

Chain GraphsCausal Models

DGMsUGMs

Bayesian Networks

MRFs

Gibbs/Boltzman

Distributions

DBNs Mixture

Models

Decision

Trees

Simple

Models

PCA

LDAHMM

Factorial HMM/Mixed

Memory Markov Models

BMMs

Kalman

Other Semantics

FST

Dependency Networks

Segment Models

Figure 3: A topology of graphical model semantics and structure

2.0.3 Implementation

When two nodes are connected by a dependency edge, the local conditional probability representation of that de-pendency may be called itsimplementation. A dependence of a variableX on Y can occur in a number of waysdepending on if the variables are discrete or continuous. For example, one might use discrete conditional probabilitytables (CPTs), compressed tables [42], decision trees, or even a deterministic function (in which case GMs may rep-resent data-flow [1] graphs, or may represent channel coding algorithms [40]). GMTK, described in Section 6 makesheavy use of deterministic dependencies. A node in a GM can also depict a constant input parameter since random

8

variables can themselves be constants. Alternatively, the dependence might be linear regression models, mixturesthereof, or non-linear regression (such as a multi-layered perceptron [14], or a STAR [100] or MARS [41] model). Ingeneral, different edges in a graph will have different implementations.

In UGMs, conditional distributions are not represented explicitly. Rather a joint distribution over all the nodes inthe graph is specified with a product of what are called “potential” functions over cliques in the graph. In generalthe clique potentials could be anything, although particular types are commonly used (such as Gibbs or Boltzmanndistributions [54]). Many such models fall under what are known as exponential models [34]. The implementation ofa dependency in an UGM is implicitly specified via these functions in that they specify the way in which one variablecan influence the resulting probabilities for other random variable values.

2.0.4 Parameterization

The parameterization of a model corresponds to the parameter values of a particular implementation in a particularstructure. For example, with linear regression, parameters are simply the regression coefficients; for a discrete proba-bility table the parameters are the table entries. Since parameters of distributions which are random can themselves beseen as nodes, Bayesian approaches may easily be represented [51] with GMs.

Many algorithms exist for training the parameters of a graphical model. These include maximum likelihood [34]such as the EM algorithm [30], discriminative or risk minimization approaches [101], gradient descent [14], samplingapproaches [73], or general non-linear optimization [38]. The choice of algorithm depends both on the structure andimplementation of the GM. For example, if there are no hidden variables, an EM approach is not required. Certainstructural properties of the GM might render certain training procedures less crucial to the performance of the model[11]

2.1 Efficient Probabilistic Inference

A key application of any statistical model is to compute the probability of one subset of random variables given valuesfor some other subset, a procedure known as probabilistic inference. Inference is essential both to make predictionsbased on the model and to learn the model parameters using, for example, the EM algorithm [30, 77]. One of thecritical advantages of GMs is that they offer procedures for making exact inference as efficient as possible, much moreso than if conditional independence is ignored or is used unwisely. And if the resulting savings is not enough, thereare GM-inspired approximate but still more efficient inference algorithms.

ABF E D C

Figure 4: The graph’s independence properties are used to move sums inside of factors.

Exact inference can in general be quite computationally costly. For example, suppose there is a joint distributionover 6 variablesp(a, b, c, d, e, f) and the goal is to computep(a|f). This requires bothp(a, f) and p(f), so thevariablesb, c, d, e must be “marginalized”, or integrated away to formp(a, f). The naive way of performing thiscomputation would entail the following sum:

p(a, f) =∑

b,c,d,e

p(a, b, c, d, e, f)

Supposing that each variable hasK possible values, this computation requiresO(K6) operations, a quantity whichis exponential in the number of variables in the joint distribution. If, on the other hand, it was possible to factor thejoint distribution into factors containing fewer variables, it would be possible to reduce computation significantly. Forexample, under the graph in Figure 4, the above distribution may be factored as follows:

p(a, b, c, d, e, f) = p(a|b)p(b|c)p(c|d, e)p(d|e, f)p(e|f)p(f)

so that the sump(a, f) = p(f)

∑b

p(a|b)∑

c

p(b|c)∑

d

p(c|d, e)∑

e

p(d|e, f)p(e|f)

9

requires onlyO(K3) computation. Inference in GMs involves formally defined manipulations of graph data structuresand then operations on those data structures. These operations provably correspond to valid operations on probabilityequations, and they reduce computation essentially by moving sums, as in the above, as far to the right as possible inthese equations.

The graph operations and data structures needed for inference are typically described in their own light, withoutneeding to refer back to the original probability equations. One well-known form of inference procedure, for example,is the junction tree (JT) algorithm [84, 60]. In fact, the commonly used forward-backward algorithm [87] for hiddenMarkov models is just a special case of the junction tree algorithm [98], and which is, in turn, a special case of thegeneralized distributive law [2].

The JT algorithm requires that the original graph be converted into a junction tree, a tree of cliques with eachclique containing nodes from the original graph. A junction tree possesses therunning intersection property, wherethe intersection between any two cliques in the tree is contained in all cliques in the (necessarily) unique path betweenthose two cliques. The junction tree algorithm itself can be viewed as a series of messages passing between theconnected cliques of the junction tree. These messages ensure that the neighboring cliques are locally consistent (i.e.,that the neighboring cliques have identical marginal distributions on those variables that they have in common). If themessages are passed in an order that obeys a particular protocol, called themessage passing protocol, then becauseof the properties of the junction tree, local consistency guarantees global consistency, meaning that the marginaldistributions on all common variables in all cliques in the graph are identical, and also guarantees that inference iscorrect. Because only local operations are required in the procedure, inference can thus be much faster than if theequations were manipulated naively .

For the junction tree algorithm to be valid, however, a decomposable model must first be formed from the originalgraph. Junction trees exist only for decomposable models, and a message passing algorithm can provably be shown toyield correct probabilistic inferenceonly in that case. It is often the case, however, that a given DGM or UGM is notdecomposable. In such case it is necessary to form a decomposable model from a general GM (directed or otherwise),and in doing so make fewer conditional independence assumptions. Inference is then solved for this larger familyof models. Solving inference for a larger family still of course means that inference has been solved for the smallerfamily corresponding to the original (possibly) non-decomposable model.

Two operations are needed to transform a general DGM (Bayesian network) into a decomposable model: moral-ization and triangulation. Moralization joins the unconnected parents of all nodes and then drops all edge directions.This procedure is valid because more edges means fewer conditional independence assumptions or a larger family ofprobability distributions. Moralization is required to ensure that the resulting UGM does not violate any of the con-ditional independence assumptions made by the original DGM. In other words, after moralizing, it is assured that theUGM will make no independence assumption that is not made by the original DGM. If such an invalid independenceassumption was made, then the inference algorithm could easily be incorrect.

After moralization, or if starting from a UGM to begin with, triangulation is necessary to produce a decomposablemodel. The set of all triangulated graphs corresponds exactly to the set of decomposable models. The triangula-tion operation [84, 69] adds edges until all cycles in the graph with non-consecutive nodes (along the cycle) have aconnected pair. Triangulation is valid because more edges enlarge the set of distributions represented by the graph.Triangulation is necessary because only for triangulated (or decomposable) graphs do junction trees exists. A goodsurvey of triangulation techniques is given in [66].

Finally, a junction tree is formed from the triangulated graph by, first, forming all maximum cliques in the graph,next connecting all of the cliques together into a “super” or “hyper” graph, and finally finding a maximum spanningtree [24] amongst that graph of maximum cliques. In this case, the weight of an edge between two cliques is set to thenumber of variables in the intersection of the two cliques. Note, that there are several ways of forming a junction treefrom a graph, the method described above is only one of them.

For a discrete-node-only network, probabilistic inference using the junction tree algorithm has complexityO(

∑c∈C

∏v∈c |v|) whereC is the set of cliques in the junction tree,c is the set of variables contained within a

clique, and|v| is the number of possible values of variablev. The algorithm is exponential in the clique sizes, aquantity important to minimize during triangulation. There are many ways to triangulate [66], and unfortunately theoperation of finding the optimal triangulation (the one with the smallest cliques) is itself NP-hard. For an HMM, theclique sizes are fixed at size 2 (two), so the complexity isN2 whereN is the number of HMM states, and there areT cliques leading to the well knownO(TN2) complexity for HMMs. Further information on the junction tree andrelated algorithms can be found in [60, 84, 25, 61].

Exact inference, such as the above, is useful only for moderately complex networks since inference is NP-hard in

10

general [23]. Approximate inference procedures can, however, be used when exact inference is not feasible. Thereare several approximation methods including variational techniques [94, 58, 62], Monte Carlo sampling methods [73],and loopy belief propagation [104]. Even approximate inference can be NP-hard however [27]. Therefore, it is alwaysimportant to use a minimal model, one with least possible complexity that still accurately represents the importantaspects of a task.

The complete study of graphical models takes much time and effort, and this brief survey is nowhere close to acomplete survey. For further and more complete information, see the references mentioned above.

3 Graphical Models for Automatic Speech Recognition

The underlying statistical model most commonly used for speech recognition is the hidden Markov model (HMM).The HMM, however, is only one example in the vast space of statistical models encompassed by graphical models.In fact, a wide variety of algorithms often used in state-of-the-art ASR systems can easily be described using GMs,and these include algorithms in each of the three categories: acoustic, pronunciation, and language modeling. Whilemany of these ASR approaches were developed without GMs in mind, each turns out to have a surprisingly simple andelucidating network structure. Given an understanding of GMs, it is in many cases easier to understand the techniqueby looking first at the network than at the original algorithmic description. While it is beyond the scope of thisdocument to describe all of the models that are commonly used for speech recognition and language processing, manyof them are described in detail in [7]. Additional graphical models that explicitly account for many of the aspects of aspeech recognition system are described in Sections 5 and 4.

4 Structural Discriminability: Introduction and Motivation

Discriminative parameter learning techniques are becoming an important part of automatic speech recognition technol-ogy, as indicated by recent advances in large vocabulary tasks such as Switchboard [107], which now complement wellknown improvements in small vocabulary tasks like digit recognition [82]. These techniques are exemplified by themaximum mutual informationlearning technique [4] or the minimum classification error (MCE) method [63], whichspecify procedures for discriminatively optimizing HMM transition and observation probabilities. These methodolo-gies adopt afixedpre-specified model structure and optimize only the numeric parameters. From a graphical-modelpoint of view, the model structure is fixed while the parameters of the model may vary.

In statistics, the methods of discriminant analysis generalize on discriminative methods in ASR, as the goal inthis case is to design the parameters of a model that is able to distinguish as best as possible between a set of objects(visual, auditory, etc.) as represented by a set of numerical feature values [76].

Moreover, methods of statistical model selection are also commonly used [71, 18] in an attempt to discover thestructure and/or the parameters of a model that allow it to best describe a given set of training data. Typically, suchapproaches include an inverse cost function (such as the model’s likelihood) which is maximized. When the costfunction is minimized (likelihood is maximized), it is assumed that the model is at the point where it best represents thegiven training data (and indirectly the true distribution). This cost function is typically further offset by a complexitypenalty term so as to ensure that the model that is ultimately selected is not one that merely describes the training dataexcessively well without having an ability to generalize. Such approaches are seen in the guise of regularization theory[86], minimum description length [92], Bayesian information criterion [95], and/or structural risk minimization [101].

The technique ofstructural discriminability[10, 6, 11] stands in significant contrast to the methods above. In thiscase, the goal is to learn discriminatively the actual edge structure between random variables in graphical models thatrepresent class-conditional probabilistic models. The structure is selected to optimize not likelihood, but rather a costfunction that indicates how well the class conditional models do in a classification task. Also, a goal is, within the setof models that do equally well, to choose the one that is as simple as possible. Therefore, the edges in the structurallydiscriminative graph will almost certainly encode conditional independence statements that arewrong with respectto the data. In other words, the resulting conditional independence statements made by the model might not be true,and there is nothing that is attempting to ensure that the conditional independence statements are indeed true. Rather,the conditional independence statements will be made only for the sake of optimizing a cost function representingdiscriminability (such as the number of classification errors that are made, or the KL-divergence between the true andthe model posterior probability).

11

Structural discriminability is orthogonal to and complementary with the methods used for fixed-structure parameteroptimization. This means that once a discriminative structure is determined, it can be possible to further optimize themodel using discriminative parameter training methods to yield further gains. On the other hand, it might be possiblethat structural discriminability obviates discriminative parameter training – we will explore this idea further below.

At the basis of all pattern classification problems is a set ofK classesC1, . . . , CK , and a representation of each ofthese classes in terms of a set ofT random variablesX1, . . . , XT (denoted asX1:T for now). For each class, one isinterested in obtaining discriminant functions, functions the maximization of which should yield the correct class. Forexample ifgk(X1:T ) is the discriminant function for classk, then one would perform the operation:

k∗ = argmaxk

gk(X1:T )

If X1:T indeed represented an object of classk∗, then an error would not occur. IfX1:T represented instead a differentobject, sayk′, then an error condition occurs. The ultimate goal therefore is to find functionsgk such that the errorsare minimum over a training data set [33, 76, 101].

It is possible (but not necessary) to place the above goal and procedure into a purely probabilistic setting, wherethe discriminant functions are either probabilistic or functions of probabilistic quantities. Given the true posteriorprobability of the class given the featuresP (Ck|X1:T ) one can clearly see that the error of choosing classk givena feature setX1:T will be 1 − P (Ck|X1:T ). Therefore, to minimize this error, the class should be chosen so asto maximize the posterior probability (which will minimize error), leading to the following and well known Bayesdecision rule [33]:

k∗ = argmaxk

P (Ck|X1:T )

This is a form of discrimination using the posterior probability, and if a model ofP (Ck|X1:T ) is formed, sayP (Ck|X1:T ), then the following decision function

k∗ = argmaxk

P (Ck|X1:T )

is said to use adiscriminative modelP (Ck|X1:T ). This model can take many forms, such as logistic regression,neural network [14, 76], support vector machine [101], and so on. In each case, the function form of the probabilisticdiscriminative model is chosen, and is then optimized (trained) in some way. Typically, the structure and form of sucha function is fixed, and the only way that the structure can change is (potentially) by certain parameter coefficientshaving a “zero value”, thereby rendering the structure controlled by these coefficients essentially non-existent. Duringtraining, however, there is typically no guarantee that such coefficients can or will be zero. Nor is there a guaranteethat the instance of zero coefficients in the model could correspond to conditional independence statements that arepossible to be stated by a graphical model of a particular semantics. It is often useful to have as many non-harmfulconditional independence statments made as possible because the resulting model is much simpler.

For completeness, we note that the above probabilistic decision function can be generalized further to take intoaccount a loss function which measures the potential difference in severity between making different kinds of mistakes.For example, if the true class isk1 and classk4 is chosen, the severity of such a mistake might be much less than ifk2

were chosen. This is often encoded by producing a loss functionL(k′|k), which is the loss (penalty) of choosing classk′ when the true class isk. This is used to produce a risk functionR(k):

R(k′|X1:T ) =∑

k

L(k′|k)P (Ck|X1:T )

which is the expected loss of choosing classk′. The goal is to choose the class that minimizes the overall risk as in:

k∗ = argmink′

R(k′|X1:T )

This decision rule is provably optimal (minimizing the expected loss) for any given loss function [33, 34]. Moreover,for the 0/1-loss functions (a loss function that is 1 for allk andk′ except is zero for the correct classk = k′), it is easyto see that this decision rule degenerates into the posterior maximization decision procedure above. It is this 0/1-lossfunction case that we examine further below.

12

It is often useful in many applications (such as speech recognition) to use Bayes rule to decompose the posteriorprobabilityP (Ck|X1:T ) within the decision rule thus:

k∗ = argmaxk

P (Ck|X1:T ) = argmaxk

P (X1:T |Ck)P (Ck)/P (X1:T )

On the right most side, it can be seen that the maximization overk is not affected byP (X1:T ), so an equivalentdecision rule is therefore:

k∗ = argmaxk

P (X1:T |Ck)P (Ck) (1)

This decision rule involves two factors: 1) the prior probability of the classP (Ck), and 2) the likelihood of the datagiven the classP (X1:T |Ck). This latter factor, when estimated from data and denoted asP (X1:T |Ck) is often calleda generative model. It is a generative model because it is said to be able togeneratelikely instances of a given class,as represented by the features. For example, if the generative modelP (X1:T |Ck) was an accurate approximation ofthe true likelihood functionP (X1:T |Ck), and if a sample of the distribution was formed asx1:T ∼ P (X1:T |Ck), thenthat sample would (with high probably) be a valid instance of the classCk, at least as best as can be represented bythe feature vectorX1:T .

Moreover, ifP (X1:T |Ck) is an accurate representation of the true likelihood function, andP (Ck) is an accuraterepresentation of the class prior, then the decision rule:

k∗ = argmaxk

P (X1:T |Ck)P (Ck)

would lead to an accurate decision rule. Therefore, a goal that is often pursued is to find accurate as possible likelihoodand prior approximations. This leads naturally to standard maximum likelihood training procedures: given a data setconsisting ofN independent samples,D = {(x1

1:T , k1) . . . , (xN1:T , kN )}, the goal is to find the approximation of the

likelihood function that best explains the data, or:

P ∗ = argmaxP

N∏i=1

P (xi1:T |ki)

where the optimization is done over some set (possibly infinite) of likelihood function approximations that are beingconsidered. Note that because the samples are assumed to be independent of each other, the optimization can bebroken intoK separate optimization procedures, one for each classk = 1, . . . ,K:

P ∗(|k) = argmaxP (|k)

∏i:ki=k

P (xi1:T |k)

whereP (|k) indicates that the optimization is done only over those class conditional likelihood approximation func-tions that could be considered as a generative model for classk.

It can be proven that as the size of the (training) data setD grows to infinity, and if within the set of modelsP (|k) that are optimized over lies the true class conditional likelihood functionP (X1:T |k), then the maximum like-lihood procedure will converge to the true answer. This is the notion of asymptotic consistency and is given formaltreatment in many texts such as [26]. It is moreover the case that the maximum likelihood procedure minimizes theKL-divergence between the true likelihood function and the model, as in:

argminˆP (|k)

D(P (X1:T |k)||P (X1:T |k))

The maximum likelihood training procedure is therefore a well-founded technique to obtain an approximation of thelikelihood function.

Returning to Equation 1, we can see that the maximum likelihood procedure, even in the case of asymptoticconsistency and the like, is only a sufficient condition for finding an optimal discriminant function, but not a necessarycondition. In fact, we would be happy with any of a number of functionsf living in the familyF defined as follows:

F = {f : argmaxk

f(X1:T , Ck)P (Ck) = argmaxk

P (X1:T |Ck)P (Ck),∀X1:T , k}

13

Region 1 Region 2

Region 3

Region 4

Region 1 Region 2

Region 3

Region 4

X XY Y

Figure 5: A 2-dimensional spatial example of discriminability. The left figure depicts the true class-conditional dis-tributions, in this case an example of a four class problem. The region where each class has the highest probabilityis indicated by a different color (or shade). Also, contour plots are given of the different class conditional densities.For example, region and class 4 shows what could be a mixture of four non-convex component density functions.The true distributions lead to the decision boundaries that separate each of the regions. On the right, class-conditionaldensities are shown that might lead to the exact same decision boundaries as shown on the left. The densities in thiscase, however are much simpler than on the left – the right densities are formed without regard to the complexities ofthe left densities at points other than the decision boundaries. A goal of forming a discriminant function should be notto model any complexity of the likelihood functions that does not facilitate discrimination.

This means that anyf ∈ F when multiplied by the prior probability and then used as a discriminant functionwill be just as effective for classification as the true likelihood function. Clearly,P (X1:T |Ck) lives within F, but thecrucial point is that there might be many others, some of which aremuchsimpler – simple in this case could meaneasy computationally to evaluate, have few parameters, have a particularly easy to understand functional form, etc. Agoal, then should be to find thef ∈ F that is as simple as possible.

Note that some of thef ∈ F will be valid distributions themselves (i.e., are non-negative and integrate to unity),and others will be general functions. In this work, we are interested primarily in thosef ∈ F functions which indeedare valid densities.

A simple 2-D argument will further exemplify the above. On the left of Figure 5 shows contour plots for fourdifferent class conditional likelihood functions in a 2-D space. Each of the regions at which one of the likelihoodfunctions is maximum is indicated by a color, as well as by the existence of the decision regions in the plot. Ascan be seen, each class-conditional density consists of a complex multi-modal distribution that (from looking at thefigure) possibly results from a mixture of non-convex component functions. These distributions result in the decisionboundaries and regions as shown. Any x-y point falling directly on one of the boundaries could be in either one of thetwo abutting classes.

When the goal is classification, it is not necessary to represent the complexity of the class conditional distributionsat regions of the space other than at the decision boundaries. If different generative class-conditional density functionswere discovered having the same exact boundaries, but a much simpler form away from the boundaries, the classifi-cation error would be the same but the resulting class-conditional likelihood functions could be much simpler. This isindicated on the right of Figure 5, where uni-modal distributions have replaced the multi-modal ones on the left, butwhere the decision boundaries have not changed.

Note that such class-conditional functions, while being “generative” in that they would generate something (i.e.,are valid densities), would not necessarily generate accurate samples of the true class. For example, a large X isindicated on the left of Figure 5, indicating values of a feature vector that are likely to have been generated from the4th class-conditional likelihood function. It is likely that such anX is an accurate representation of an object of type4. The same relative position is marked on the right of the figure, also with an X. In this case, it is not likely to havebeen generated by the generative model for class 4 since it is not located at a point of high probability. Moreover, thelarge Ys indicate a point that is likely to be generated by the models on the right, but not by the true generative modelson the left. These generative models on the right, therefore, do not necessarily generate objects typical of the classes

14

Objects of class A Objects of class B

Discriminative Generativemodel for A

Discriminative Generativemodel for B

Figure 6: Pictorial example of distinctive features. This figure shows two types of key-like objects: on the left areobjects consisting of an annulus and a protruding horizontal bar, on the right objects consist of a diagonal and then ahorizontal bar. When designing an algorithm that makes a distinction between these two types of objects, the horizontalbar feature would not be beneficial in this process since it is common to both object types. It is sufficient only torepresent the objects by their unique features relative to each other, as shown on the bottom. Models of the uniqueattributes of objects could be simpler since they contain only the minimal set of features necessary to discriminate.

the models are supposed to represent. Instead, the generative models are minimal and represent only what is neededfor discrimination. Therefore, we call these densitiesdiscriminative generative models[11].

Another visual geometric example can further begin to motivate discriminative generative models, and ultimatelystructural discriminability. Consider Figure 6 which shows also in 2-D instances of two different types of key-likeobjects. The top of the figure shows instances of objects of class A, which are annuli with a protruding horizontal barto the right. Objects of class B are diagonal bars with protruding horizontal bars to the right. As can be seen, objectsof class A and class B have in common the horizontal bars, and their distinctive features are either the annuli or thediagonal bars. Therefore, in discriminating between objects of different types, one would expect that the horizontalbars of the two objects would not be very useful. For discrimination, any model of the two object types might notneed to expend resources representing the horizontal bars, and should instead concentrate on the annuli and diagonalbars respectively. This latter case is shown in the bottom of the figure, where only those discriminative aspects of theobjects are represented. In this case, less about the objects needs to be “remembered” thereby leading to a simplercriterion for deciding amongst the two classes.

In general, when the task is pattern classification, it should be necessary for a model only to represent those featuresof the objects which are crucial for discrimination. Feature which are common to both objects could potentially beignored entirely without any penalty in classification performance.

15

V1 V2

V3

V1 V2

V3

1 2 3 4( , , , | 1)P V V V V C = 1 2 3 4( , , , | 2)P V V V V C =

1 2 3 4ˆ( , , , | 1)P V V V V C = 1 2 3 4

ˆ( , , , | 1)P V V V V C =

V4 V4

V1 V2

V3

V1 V2

V3V4 V4

Figure 7: Structural Discriminability.

We finally come to the idea of structural discriminability. The essential idea is to find generative class-conditionallikelihood functions that are optimal for classification performance and simplicity. We optimize these likelihood mod-els over the space of conditional independence properties as encoded by a graphical model. This means that the goal isto find minimal edge sets such that discrimination is preserved. Minimal edge sets are desirable since the fewer edgesin a graphical model, the more conditional independence statements are made, which can lead to fewer parameters,cheaper probabilistic inference, greater generality for limited amounts of training data, and to concentrating modelingpower only on what is important. In other words, the aim of structural discriminability is to identify a minimal setof dependencies (i.e., edges in a graph) in class conditional distributionsP (X1:T |Ck) such that there is little or nodegradation in classification accuracy relative to the decision rule given in Equation 1.

A simple motivating example is given in Figure 7. The top of the figure shows the undirected graphical modelfor two generative 4-dimensional class-conditional likelihood functions: on the leftP (V1, V2, V3, V4|C = 1) for class1, and on the rightP (V1, V2, V3, V4|C = 2) for class 2. The edges show the generative models, meaning that thesemodels depict the truth. Note that many of the edges between the two models are common. For example, the edgebetweenV1 andV4 appears both in the model forC = 1 andC = 2. It might be the case that these common edgescould be removed from both models since they are a common trait of both classes. If all common edges are removedfrom both models, the result is as shown on the bottom of the figure. Here only the unique edges for the two modelsare kept, the edge betweenV1 andV3 for the left, and the edge betweenV3 andV2 on the right. These edges, representunique properties of the objects, at least as far as the conditional independence statements the graphs encode areconcerned.

It is crucial to realize that the example in Figure 7 is only an illustration. It does not imply that all commonedges in class-conditional graphical models should be removed: there might be common edges which turn out to bequite helpful for discrimination. Moreover, there might be information irrespective of the edges which are usefulfor discrimination. Take, for example, the means of two class-conditional Gaussian densities with equal sphericalcovariance matrices. The only thing producing distinction between the two classes are the means. Therefore, no edgestructure adjustment will help discrimination.

On the other hand, there are cases where structural discriminability obviates discriminative parameter training.Specifically, there are cases where an inherently discriminative structure can render discriminative parameter trainingno more beneficial than regular maximum likelihood based training. Moreover, the wrong “anti-discriminative” model

16

V1 V2

V3

1 2 3 4( , , , | 1)P V V V V C =

Object Generation:

Common Dependencies:

DiscriminativeDependencies:

V1 V2

V3

1 2 3 4( , , , | 2)P V V V V C =

1 2 3 4( , , , | 1)cP V V V V C = 1 2 3 4( , , , | 2)cP V V V V C =

1 2 3 4( , , , | 1)dP V V V V C = 1 2 3 4( , , , | 2)dP V V V V C =

V1 V2

V3

V1 V2

V3

V1 V2

V3

V1 V2

V3

Figure 8: It is possible for structural discriminability to render maximum-likelihood training ineffectual. Moreover,structural “confusability” (a network with anti-discriminative edges) can render discriminative parameter training in-effectual.

structure can render even discriminative training ineffectual at producing appropriate discriminative models.For a graphical example, consider Figure 8, and let us assume that there is no discrimination available in the

individual variables (e.g., for Gaussians, the means of the random variables are all the same, so it is only the covariancestructure which can help to produce more discriminative models). The top box shows truth, meaning the graphs thatcorrespond to the true generative models (tri-variate distributions in this case) for class 1 and class 2. The middlebox shows the edges that are common to the two true models. The bottom box shows the edges that are distinctbetween the two true models. If one insists on using the structures given in the middle box, important discriminativeinformation about the two classes might be impossible to represent. Therefore even discriminative parameter trainingwill be incapable of producing good results. On the other hand, the two bottom graphs show the distinct edges ofthe two models. Using these class-conditional models, even simple maximum-likelihood training would be able toproduce models that are capable of discriminating between objects of two classes. Discriminative training, in thiscase, therefore might not have any benefit over maximum likelihood training.

Further expanding on this example, consider Figure 9 which shows six 3-dimensional zero-mean Gaussian densi-ties corresponding to the graphs in Figure 8. Each graph shows 1500 samples from the corresponding Gaussian alongwith the marginal planar distributions for each of the variable setsV1V2, V2V3, andV1V3 (the margins, rather thanbeing shown at the actual zero locations for the corresponding axes, are shown projected onto the axes planes of the3-dimensional plots). The covariance matrices for the six Gaussians are respectively (moving across columns and thendown columns) as follows: 9 4 2

4 2 12 1 1

11 3 13 1 01 0 1

5 2 02 1 00 0 1

10 3 03 1 00 0 1

1 0 00 2 10 1 1

2 0 10 1 01 0 1

17

Figure 9: Structural Discriminability of Gaussian Structures

On the upper left of Figure 9, it can be seen clearly thatV1 depends both onV2 andV3 (i.e., V1 depends onV3

indirectly throughV2, asV1⊥⊥V3|V2), Note that the conditional independence in the top left covariance matrix can beeasily seen to hold because the determinant of the off-diagonal submatrix is zero. i.e.(

4 22 1

)= 0

reflecting a zero in the (1,3) position in the inverse covariance.On the upper right, it can be seen thatV2⊥⊥V3, as indicated by the upper right in Figure 8. From these “truth”

models, it can be seen that the ability to discriminate between the two classes lies within theV2V3 plane. The middlerow models in Figure 8, and an example of their Gaussian correlates given in Figure 9 will not be able to discriminatewell between the two classes regardless of the training method, since they only contain an edge (and therefore apossible dependency) betweenV1 andV2. The bottom row models in Figure 8 (correspondingly Figure 9) will onceagain be able to make a distinction between the two classes, and mere maximum-likelihood training would yieldsolutions such as the ones indicated.

Note that these bottom row models are not the only possible structurally discriminative conditional independenceproperties – in the example, the bottom right model could just as well assume everything is independent withoutharm. Note also that in this case a linear discriminant analysis [44] projection would enable simple lower-dimensionalGaussians to discriminate well. In the general case, however, (e.g., where the dependencies are non-linear and non-Gaussian) it can be seen that placing certain restrictions on a generative model can make them more amenable to theiruse in a discriminative context. Structural discriminability tries to do this in the space of conditional independencestatements as encoded by a graphical model.

We must emphasize here that it is not the case that structural discriminability will necessarily make discriminativeparameter training ineffective. In natural settings, it is most likely that that a combination of structural discriminabilityand discriminative parameter training will yield the best of both worlds: a simple model structure that is capable ofrepresenting the distinct properties of objects relative to competing objects, and a parameter training method to ensurethat the models make use of their ability.

18

Incidentally, it is often asked why delta [36] (first time derivative) and double-delta [70, 106] (second temporalderivative) of speech feature vectors produce a significant gain in HMM-based speech recognition systems. It turnsout that the concept of structural discriminability can be used to shed some light on this situation [7]. Delta featuregeneration process can indeed be precisely modeled by a graphical model, so why might not such precise modelingof delta features be desirable? The reason is that such edges added to a model (the correct generative model) willrender the delta features independent of the hidden variables. This will have the effect of making the delta featuresnon-informative. Therefore, by making (wrong) independence statements about the generative process of the deltafeatures, discrimination is improved (see [7] for details).

In summary, the structure of the model family can have significant effects on the parameter training method used.If the wrong model family is used, even discriminative parameter training might not be helpful.

Our goal in this work is to identify a criterion function that enables us to best tell if a given edge is discriminativeor not, and if it should be removed or added in a class-conditional generative graphical model. Ideally, there would bea measure that could be computed independently for each edge, and those edges for which the measure is good enough(e.g., above threshold) would be retained, all others being dropped. Such a measure that attempts to achieve this goal,the EAR measure, is described in detail in Section 7.

In this report, we focus on class-conditional probabilistic models that can be expressed as Bayesian networks. Wefocus further on and use a new graphical model toolkit (GMTK) for representing both standard and discriminativestructures for speech recognition. The benefits of this framework include the ability to rapidly and easily express awide variety of models, and use them in as efficient a way as possible for a given model structure.

5 Explicit vs. Implicit GM-structures for Speech Recognition

5.1 HMMs and Graphical Models

Undoubtedly the most commonly used model for speech recognition is the hidden Markov model or HMM [5, 59, 87],and so we begin by relating the HMM to graphical models. It has long been realized that the HMM is a special caseof the more general class of dynamic graphical models [98], and Figure 10 illustrates the graphical representation ofan HMM.

Recall that in the classical definition [59, 87], an HMM consists of:

• The specification of a number of states

• An initial state distributionπ

• A state transition matrixA, whereAij is the probability of transitioning from statei to j between successiveobservations

• An observation functionb(i, t) that specifies the probability of seeing the observed acoustics at timet given thatthe system is in statei.

In this formulation, the joint probability of a state sequences1, s2, . . . sT and observation sequenceo1, o2, . . . oT isgiven by

πs1

T−1∏i=1

Asisi+1

T∏i=1

b(si, i) (2)

In the case that the state sequence or alignment is not known, the marginal probability of the observations can stillbe computed, either by enumerating all possible state sequences and summing the corresponding joint probabilities,or via dynamic programming recursions. Similarly, the single likeliest state sequence can be computed.

Figure 10 shows the graphical model representation of an HMM. It is a model in which each time frame has twovariables: one whose value represents the value of the state at that time, and one that represents the value of theobservation. The conditioning arrows indicate that the probability of seeing a particular state at timet is conditionedon the value of the state at timet−1, and the actual numerical value of this probability reflects the transition probabilitybetween the associated states in the HMM. The observation variable at each frame is conditioned on the state variablein the same frame, and the value ofP (ot|st) reflects the output probabilities of the HMM. Therefore, the directedfactorization property of directed graphical models (the joint probability can be factored into a form where each factoris the probability of a random variable given its parents) immediately yields Equation 2.

19

Observation Variables

State Variables

Figure 10: Graphical model (specifically, a dynamic Bayesian network) representation of a hidden Markov model(HMM). The graph represents the set of random variables (two per time frame) of an HMM, and edges encode the setof conditional independence statements made by that HMM.

One important thing to note about the graphical model representation is that it is explicit about absolute time:each time frame gets its own separate set of random variables in the model. In Figure 10, there are exactly four time-frames represented, and to represent a longer time series would require a graph with more random variables. Thisis in significant contrast to the classic representation of an HMM, which has no inherent mechanism for representingabsolute time. Instead, in the classic HMM representation, only relative time is represented. Absolute time is of courserepresented, and is often done so in the auxiliary data structures used for specific computations.

Figure 11 makes this more explicit. At the top of this figure is an HMM that represents the word “digit.” Thereare five states (unshaded circles) representing the different sounds in the word, and a dummy initial and final state.The arcs in the HMM represent possible transitions, andnotconditional independence relationships as in the graphicalmodel. Note that this graph shows only the transition matrix of theMarkov chain(only one part of the HMM), and inparticular edges are given only when there are non-zeros in the transition matrix. The

The self-loops in the Markov chain graph depict that it is possible to be in a state at one time (with a givenprobability), and then stay in the state at the next time frame. In particular, the picture shows the Markov chain for a“left-to-right” HMM in which it is only possible to stay in the same state, or move forward in the state sequence. Thiskind of representation is not explicit about absolute time. It can represent100 frame occurrences of the word “digit”as well as10 or 1000 frame occurrences. Of course, actual computations must be specific about absolute time, andthe absolute temporal aspect is introduced into an HMM via the notion of a computational grid or its equivalent. Thisis shown at the upper right for a seven frame occurrence of the word “digit.” The horizontal axis represents time; thevertical axis represents the HMM-state set, and a path from the lower-left corner of the grid to the upper-right cornerrepresents an explicit path through the states of the HMM over time. In this case, the first frame is aligned to the /d/state, the second and third frames are aligned to the /ih/ state, and so forth. Although the structure (zeros/non-zeros)of the underlying Markov chain of the HMM is defined by the graph at the left, computations on the HMM are definedwith reference to the temporally-explicit grid.

The graphical model representation of the same utterance is shown at the bottom of Figure 11. This is a temporally-explicit structure (it shows absolute time) with seven repeated chunks, one for each of the time-slices. The assignmentto the state variables of this model correspond to the HMM path represented in the computational grid. Note thatdifferent paths through the grid will correspond to different assignments of values to the state variables in the graphicalmodel. Whereas computation in the HMM typically involves summing or maximizing over allpaths, computation inthe graphical model typically involves summing or maximizing over all possible assignments to the hidden variables.The analogy between a path in an HMM and an assignment of values to the hidden variables in a graphical model isquite important and should be kept in mind at all times.

The graphical model in Figure 11 represents the information in an HMM. In particular, it encodes the conditionalindependence assumptions made by the HMM when it is seen as a collection of random variables, two per time-slice.

Note that the graphical-model view of the HMM given above, however, does not explicitly account for certainaspects of HMMs as they are typically used in a speech recognition system. One example of this is that in practice, theutterances that are associated with an HMM represent a complete example of a word or words. A specific recordingof the word “digit” will start with the /d/ sound and end with the /t/ sound; i.e. the whole word will be spoken. Thisextra piece of information is notexplicitly captured in the previous graphical model. To see this, suppose the value/d/ is assigned to every occurrence of the state variable. (This will actually happen in the course of inference, eitherexplicitly or implicitly depending on the algorithm used.) This concrete assignment will have some probability (gottenby multiplying together many instances of the self-loop probability for /d/ along with the associated observationprobabilities), and unless specific provisions within the set of hidden random variables of the graphical model aremade, this probability might not be zero. In particular, if the hidden variables set does not have state values that

20

D =1

Q2

Q3 Q

4Q5

Q6

Q7

O1 O

2O

7

HMM Grid:

7-Frame Utterance

1 2 3 4 5 6 7

Time

D

IH

JH

IH

T

Q

Graphical Model view of an HMM

Acoustic Observation

= IH = IH = JH = IH = T = T

Emission Probabilities

represented by the

observation distributions.

Transition Probabilities represented by the Markov chain.

D IH JH IH T

Classic view of an HMM

Emission Probabilities Transition Probabilities

State

Figure 11: Comparison of two views of an HMM: the more classic views, where on the top left, an HMM is seenonly as the non-zero values of the underlying Markov chain, and on the top right as a grid where time is on thehorizontal axis and the possible states are shown on the vertical axis. The bottom plot shows the graphical modelview of an HMM, there the HMM’s underlying set of random variables (two per time frame) are shown, and the edgesspecifically encode the set of conditional independence statements as made by the model.

specifically and jointly encode both position and phoneme, it would not be possible to ensure that this assignment of/d/ to all variables would have zero probability. If such were the case, this fact would violate our prior knowledge thata complete occurrence of “digit” must be spoken.

Such zero probability assignments in the graphical model would correspond in the classic HMM framework to apath that does not end in the upper-right corner of the computational grid. It is often the case that this issue is resolvedin the program code, in that it does inference by treating the upper-right corner in a special way. This essentiallycorresponds to a particular query in the graphical model representation, namely,P (O1:7, Q1:6, Q7 = q), meaningthat the HMM is such that only assignments to the variables are considered that have a specific assignment to the lasthidden variableQ7.

Another issue concerns the use of time-invariant conditional probabilities. Consider the “digit” example. Becausethe first occurrence of the phone /ih/ is followed by /jh/, it must be the case thatP (Qt = /jh/|Qt−1 = /ih/) > 0 (sothat the transition is possible) andP (Qt = /t/|Qt−1 = /ih/) = 0 (so that the transition is not skipped). However,because the second occurrence of /ih/ is followed by /t/, it must be the case thatP (Qt = /t/|Qt−1 = /ih/) > 0andP (Qt = /jh/|Qt−1 = /ih/) = 0 which is in direct contradiction to the previous requirements. Therefore, if thecardinality of the hidden random variables in an HMM is only equal to the number of phonemes, then the HMM wouldnot be able to represent the different meanings of the states at different times, even though observation distributionsfor these different states should be shared for the same phoneme.

One way therefore of solving this problem is to distinguish between the first and second occurrences of /ih/: todefine /ih1/ and /ih2/. In general, one may increase the number of states in an HMM to encode both phoneme identityand position in word, and also sentence, sentence category, and/or any other hidden event required. One must becareful, however, because we typically desire that multiple state values “share” the same output probabilities (i.e., theobservation distributions of the same phoneme in different words is typically shared, or tied in a speech recognitionsystem, this means thatP (x|Q = q1) = P (x|Q = q2) for all x). This issue is compounded when we considerinstances of multiple words. For example, if the word “fit” - /f/ /ih/ /t/ - occurs in the database, should its /ih/ be thesame as /ih1/ or /ih2/, or something else entirely? The graphical model view of an HMM therefore (Figure 11 bottom),

21

Deterministic Transitions

1

1 2 2 3 4 5

11 101

D IH IH IH TJH

Observation

Phone

Transition

Position

End-of-Word Observation

Acoustic Parameter Tying via Explicit Structure

Figure 12: A more explicit graphical model representation of parameter tying in a speech-recognition-based HMM.

while being possible to encode these sorts of constraints, does not explicitly indicate how the constraints come to beor how they should be implemented.

The graphical model formalism, however, is able also to represent the parameter tying issue via an explicit struc-tural representation using the graph itself, rather than hiding this detail either to a particular implementation, or viaan expanded hidden state space. This has been called the explicit graphical representation approach [9], whereas thesimple HMM graphical model is called implicit approach (expanded upon when discussing GMTK in Section 6.1.1).For example, a graphical model that incorporates the word-end and parameter tying constraints that are appropriate toa typical practical application is shown in Figure 12 (see [109] for further information).

The main aspect of this representation is that there is now an explicit variable representing the position in theunderlying HMM. The position within the HMM is now distinct from the phone labeling the position, which is ex-plicitly represented by a second set of variables. The tying issue is implemented by mapping from position to phonevia a conditional probability distribution. In the example of Figure 12, positions2 and4 both map to /ih/. Differenttransition probabilities are obtained at each time point via an explicit transition variable, which is conditioned on thephone. Either there is a transition or not, and the probability depends on the phone (which means that different phonescan have their own length distributions, and a given length distribution is active depending not on the position withinthe word, but only on the phone that is currently being considered). The position is a function of the previous positionand the previous transition value: if the transition value was0, then the position remains unchanged; otherwise itincrements by one. These relationships can be straightforwardly and explicitly encoded as conditional probabilities,as shown in the figure.

In this representation, there is an “end-of-word” observation, which is assigned an arbitrary value of 1. Thisvariable ensures that all variable assignments that have non-zero probability must end on a transition out of the finalposition. This is done by conditioning this observation on the state variable, and setting its conditional probabilitydistribution so that the probability of its observed value is0 unless the position variable has the value of the lastposition. Similarly, its probability is0 unless the transition value is1. This ensures assignments in conformance withthe classic HMM convention that all paths end in a transition out of the final emitting state.

In this model, the conditional probabilities need only be specified for the first occurrence of each kind of variable,and then can be shared by all subsequent occurrences of analogous variables. Thus, we have achieved the goal ofmaking the conditional probabilities essentially “time-invariant”. However, it is important to note that the probabilitiesdo depend on the specific utterance being processed. In this model, the mapping from position to phone will changefrom utterance to utterance, as does the final position. Thus an implementation must support some mechanism forrepresenting and reading in conditional probabilities on an utterance-by-utterance basis. This is analogous to readingin word-graphs, lattices, or scripts on an utterance-by-utterance basis in a classic HMM system.

A final nuance of this model is that some of the conditional probability relationships are in fact deterministic, andnot subject to parameter estimation. In fact, Figure 12 shows that some of the edges indicate true random imple-mentations (the smoothed zig-zag or wriggled edges), and the other edges indicate deterministic implementations (thestraight edges). Specifically, the distribution controlling the position variable encodes a simple logical relationship:if the transition-parent is0, thenPositiont = Positiont−1; otherwise,Positiont = Positiont−1 + 1. Efficientlyrepresenting deterministic relationships, and exploiting them in inference, is important for an efficient implementationof a graphical modeling system [109].

To summarize, the model of figure 12 uses the following conditional probabilities:

22

1. position at frame 1: a deterministic function that is 1 with probability 1

2. position in all other frames: a deterministic function such that

P (positiont = positiont−1|positiont−1, transitiont−1 = 0) = 1

P (positiont = positiont−1 + 1|positiont−1, transitiont−1 = 1) = 1

3. phone: an utterance-specific deterministic mapping

4. transition: a dense table specifyingP (transition|phone)

5. observation: a function such as a Gaussian mixture that specifiesP (observation|phone)

6. end-of-utterance observation: an utterance-specific deterministic function such that

P (end− of − utterance = 1|position 6= final, transition 6= 1) = 0)

The graphical model in Figure 12 is useful in a variety of tasks:

• Parameter estimation when the exact sequence of words and therefore phones, including silences, is known.

• Finding the single best alignment of a sequence of frames to a known sequence of words and phones.

• Computing the probability of acoustic observations, given a known word sequence - which is useful for rescoringthe n-best hypotheses of another recognition system.

The main issue of this model is that, unless it is implemented using an expanded state spaceimplicitly within thegraph, a fully specified state sequence must be used. This is present only if one knows where silence resides betweenwords. For example, the state sequence corresponding to “hi there” will in general be different from “hisil there,”though in practice one does not know where the silences occur. Of course, one has two choices: 1) resort to the implicitapproach and use an expanded state space, or 2) use a more intricate graph which explicitly represents the possibilityof silence between words and multiple lexical variants. In the next section, we pursue the latter of the two choices.

5.2 A more explicit structure for Decoding

A more explicit (and complicated) model structure can be used for decoding, and is described in this section. Thestructure given here was developed anew specifically for the JHU 2001 workshop, and became the basis for many ofthe recognition experiments that were undertaken during this time.

In general, the goal of the decoding process is to determine the likeliest sequence of words given an observedacoustic stream, and this can be done with the model structure shown in figure 13. This model is similar in spirit tothat of Figure 12, but there is an extra “layer” of both deterministic and random variables added on top. Again, thewriggled edges correspond to true random implementations, and all straight edges depict determinism.

Since the goal is to discover a word sequence, there must be some representation of words, and this is obtainedby associating an explicit “word” variable with each frame. The other addition is a “word transition” variable, whichindicates when one word ends and another begins. The “word position” variable is analogous to the previous “position”variable, and indicates which phone of a word is under consideration – the first, second, etc.

The logic of this network is fairly straightforward. The combination of word and word-position determines aphone. As before, both the observation and transition variables are conditioned on this. There is a word transitionwhen there is a phone transition in the last phone of a word. Finally, the word variable retains its value across time ifthere is no word transition, and otherwise takes a new value according to a probability distribution that is conditionedon the previous word value. In other words, when the word transition variable at timet− 1 is zero, the word variableat timet is a deterministic function of the word variable at timet− 1. When the word transition variable at timet− 1is one, the word variable at timet is random, according to a bigram language model. For this reason, the edge betweenword variables are shown as both a straight edge and a wriggled edge: it is straight only when the previous wordtransition is zero. Such a construct can be encoded using switching dependencies within GMTK (see Section 6.1.7).

The only major constraint on decoding is that the interpretation should end at the end of a word, and this is enforcedby adding an “end-of-utterance” observed variable and conditioning it on the word transition variable. The probability

23

Transition

Transition

Word-

Word-

Position

Acoustics

End-of-

Utterance = 1

Phone-

Phone

Word

Figure 13: A graphical model for decoding.

Transition

TransitionWord-

Acoustics

End-of-Utterance = 1

LastWord

Phone-Position

Word

Phone

Word-

Figure 14: A model for decoding with a trigram language model.

distribution is set so that the observed end-of-utterance value is only possible (obtains non-zero probability) if the finalword-transition value is1. As in the simpler model of the previous section, all the logical relationships that were usedto describe the network can be encoded as conditional probabilities.

This network is used for decoding by finding the likeliest values of the hidden variables. The decoded wordsequence can be read off from the word variable in frames for which the word transition has the value1. (Because aword may be repeated as in “fun fun”, one cannot simply look at the sequence of assignments to the word variablealone since that would not inform if a duplicate word occurred, or if it was just an longer instance of a single word.)

This decoding model uses a bigram language model of word sequences. That is, the probability contributed bythe word variablesw is factored as:P (w) =

∏t P (wt|wt−1). A trigram language model that conditions on the two

previous words -P (w) =∏

t P (wt|wt−1, wt−2) - is more common, and is encoded in the somewhat more complexmodel of Figure 14. In this model, there is another layer of variables, “last-word,” that represent the wordwt−2. Whenthere is a word transition, the next word is chosen from a distribution that is conditioned on both the current and lastword, and appropriate copy operations are done so that the last-word variable will always have the correct value. Whenthere is no word transition, the word is like before just a duplicate of the word at the previous frame. The tri-gramprobabilityp(wt|wt−1, wt−2) is shown in the figure by having two wriggled edges converging at the word variable,where the edge from the previous word is wriggled only in the case when the word transition is one (again, the edge isshown both as a straight edge and as a wriggled edge).

24

Phone-Transition

TransitionWord-

PositioninUtterance

Word-Position


Skip-Sil

Acoustics

Phone

Word

Figure 15: Training with optional silence. The structure below the dotted line is substantially the same as the bigramdecoding structure given in Figure 13

5.3 A more explicit structure for training

Figure 12 illustrates a model that is suitable for training. The adjustable parameters are simply the output and transitionprobabilities, and when training is completed these are available for use in the same or other network structures. Dueto its simplicity and efficiency, this structure is relatively efficient doing training (it has only size three cliques in thegraph), whenever the exact word and phone sequence is known.

Unfortunately, this is not always the case. For example, occurrences of silence may not be indicated in a transcrip-tion of training data, but it is known that silence does exist and therefore should be probabilistically modeled duringtraining as well as during recognition. As another example, there might be multiple possible pronunciations for thewords that are known, e.g. /T AH M EY T OW/ or /T AM M AA T OW/, and it is desirable not to choose one of thosepronunciations at training time because the phone-transcriptions are not available. Again, such a feature can be addedeither implicitly or explicitly, and we pursue the latter.

Figure 15 illustrates a model structure that is able to consider all possible insertions of silence between words. Thisstructure is essentially the same as Figure 13 used for bigram decoding, with a small modification. There are two newvariables, one denoting the position within the utterance, and the other denoting whether silence should be inserted orskipped at a word transition.

The “position within utterance” variable denotes the position (first, second, third, etc.) in the training script, whereinter-word silences are explicitly numbered. For example, in the phrase “the cat in the hat,” the numbering is as:sil(1) the(2) sil(3) cat(4) sil(5) in(6) sil(7) the(8) sil(9) hat(10) sil(11)The probability distribution governing position-within-utterance can now be described. The basic idea is that if thereis a transition at the end of a position denoting silence (e.g. position 3), then the position variable advances by one tothe next word. On the other hand, if there is a transition at the end of a normal word position (e.g. position 4), thendepending on the value of the skip-silence variable, the position either advances by one (and the silence is inserted) orby two (and the silence is skipped). The end-of-utterance variable is set up so that its assigned value is only possibleif the position is either the last word or silence, and there is a word transition.

Note that in this case the word variable is a deterministic function only of the position variable. During training, itis known what the first, second, etc. word is, so this is certainly possible to do (and is how GMTK implements it).

Training with multiple pronunciation or lexical variants for each word is more complicated. Figure 16 illustratesone possible graphical model that can accommodate multiple lexical variants, as well as optional silence. The mainaddition here is a variable that explicitly represents the lexical variant of the word being uttered. For example, “tomato”has two; a lexical-variant value of0 represents /T AH M EY T OW/ and a value of1 represents /T AM M AA T OW/.

25

Phone-Transition

TransitionWord-

PositioninUtterance

Word-Position


Acoustics

Lexical Variant

Word

Phone

Skip-Sil

Figure 16: Training with lexical variants.

The phone value is now determined by a combination of word, word-position, and lexical variant. The lexical variantis selected according to an appropriate distribution when the word-transition value is1. In the case that there is notransition, it is simply copied from the previous frame, which is why there is an edge between consecutive occurrencesof the variable. When there is a transition, the lexical variant is chosen based on the new word (since there willtypically be a different number of variants with differing probabilities for each word). This again depicts a switchingdependency: when word transition at timet− 1 is zero, lexical variant at timet is a copy (deterministic) of itself fromthe previous time frame. When word transition at timet − 1 is one, lexical variant at timet is determined randomlybased on the value of word at timet (i.e., the new word). Although we do not present the decoding graph, a similarmodification of the previous decoding networks allows for decoding with multiple lexical variants.

5.4 Rescoring

It is frequently the case that one has a reasonably good system that produces a set of word hypotheses that one thenwants to choose between on the basis of a more sophisticated model. This process is referred to as rescoring, andGMTK is ideally suited to rescoring existing hypotheses with a more sophisticated model.

The simplest kind of rescoring, and the one we will consider in this section, is n-best rescoring, where there arenpossible word sequences to choose between, and the sequences are simply enumerated out; for example,The cat in the hat.The cat is the hat.The cat in a hat.The cat is a hat.When the hypotheses are enumerated like this, rescoring can be done simply by computing the data probability ac-cording to each hypothesis with the basic model of Figure 12. This is a nice situation, because then exactly the samemodel structure can be used for both training and testing. There are two disadvantages, however, the first being that anoutside system is required to produce the hypotheses, and the second that a much more compact representation is oftenavailable in the form of a word lattice, as illustrated for “the cat in the hat” in Figure 17. While the first disadvantage isintrinsic to the rescoring paradigm, the second can be alleviated with a somewhat more sophisticated model structure,as discussed in the following section.

26

hatThe catis

in

the

a

Figure 17: A word lattice. Each path from the leftmost point to the rightmost point represents a possible wordsequence. The number of complete distinct paths can grow exponentially in the number of edges in the lattice, makingit a far more compact representation than an enumeration of all possible word sequences.

OWAH

AA

P

T

AH T

M

EY

T

Figure 18: A stochastic finite state automaton that represents several words and their pronunciation variants. Theinitial state is on the left and shaded; the final state is shaded on the right. Each path through the graph represents avalid pronunciation of a word.

5.5 Graphical Models and Stochastic Finite State Automata

Since lattices are a compact and useful representation in many applications, it is fortunate that a straightforwardprocedure allows them to be represented and manipulated in the graphical model framework [109]. To understand theexact analogy, it is necessary to define exactly what we mean by a lattice. This can be done with the following andrelatively standard definition. A lattice consists of:

1. a set of states;

2. a set of directed arcs connecting the states;

3. a subset of states identified as “initial states”

4. a subset of states identified as “final states”

5. a probability distribution over the arcs leaving each state

The semantics of a lattice are that it represents a set of paths, each of which starts in an initial state, ends in a final state,and has a probability equal to the product of the transition probabilities encountered on the way. Figure 18 illustratesa stochastic finite state automaton.

As detailed in [109], there is a straightforward construction process by which a set of paths can be represented ina graphical model. The one restriction is that the paths must be of a given length; in real applications with concreteobservation streams, this is always the case. As in previous examples, the key is to explicitly represent the state thatis occupied at each time frame. Also in common with previous examples, there is a transition variable at each framethat in this case specifies which arc to follow out of the lattice state. Figure 19 illustrates a graphical model that canrepresent a generic SFSA.

In the model of Figure 19, the cardinality of the state variables is equal to the number of states in the underlyingSFSA. The cardinality of the transition variable is equal to the maximum out degree of any state. The conditionalprobability of the transition variable taking a particular valuek, given that the state variable has valuej, is equal to theprobability of taking thekth arc out of statej. The end-of-path variable has an artificially assigned value of1, whichis only possible if the state variable has a value equal to a predecessor of a final state, and the transition variable has avalue that denotes an arc leading to a final state. It can be easily shown [109] that conditional probabilities defined inthis way lead to a model in which each assignment of values to the variables corresponds to a valid path through theautomaton - and gets the same probability, or else it corresponds to an illegal path and gets0 probability.

Note that each state (except initial and final ones) in the SFSA of Figure 18 is labeled with an output symbol. Insome cases, it is useful to have “null” or unlabeled states in the interior of the graph. In this case, multiple transitions

27

End-of-Path Observation

State Variables

Transition Variables

Figure 19: A graphical model structure for representing generic SFSAs with path-length four. An assignment to thestate and transition variables corresponds to a path through the automaton.

may be taken within a single time-frame, and the representation of Figure 19 is no longer adequate. Instead, multipletransition and state variables are required in each frame. This type of model is discussed in more detail in [109].

Note also that in the previous section, we saw a lattice in which the arcs - not the nodes - were labeled. In fact,a trivial conversion in which each labeled arc is broken into two unlabeled arcs with a labeled state sandwiched inbetween shows that the two representations are equivalent.

6 GMTK: The graphical models toolkit

As mentioned in earlier sections, with GMs, one uses a graph to describe a statistical process, and thereby defines oneof its most important attributes, namely conditional independence. Because GMs describe these properties visually, itis possible to rapidly specify a variety of models without much effort. Again, GMs subsume much of the statisticalunderpinnings of existing ASR techniques — no other known statistical abstraction appears to have this property.More importantly, the space of statistical algorithms representable with a GM is enormous; much larger than what hasso far been explored for ASR. The time therefore seems ripe to start seriously examining such models.

Of course, this task is not possible without a (preferably freely-available and open-source) toolkit with which onemay maneuver through the model space easily and efficiently, and this section describes the first version of GMTK.GMTK can represent all of the models that have been described in previous sections, and was used throughout the2001 JHU workshop.

GMTK is meant to complement rather than replace other publicly available packages — it has unique features,ones that are different from both standard ASR-HMM [108, 99, 57] and standard Bayesian network [16, 81] packages.

This section contains a detailed description of GMTK’s features, including a language for specifying structures andprobability distributions, logarithmic space exact training and decoding procedures, the concept of switching parents,and a generalized EM training method which allows arbitrary sub-Gaussian parameter tying. Taken together, thesefeatures endow GMTK with a degree of expressiveness and functionality that significantly complements other publiclyavailable packages. Full documentation can be found on the features and use of GMTK at the following WEB location,given in the citation [8].

6.1 Toolkit Features

GMTK has a number of features that support a wide array of statistical models suitable for speech recognition andother time-series data. GMTK may be used to produce a complete ASR system for both small- and large-vocabularydomains. The graphs themselves may represent everything from N-gram language models down to Gaussian compo-nents, and the probabilistic inference mechanism supports first-pass decoding in these cases.

6.1.1 Explicit vs. Implicit Modeling

In general and as discussed in Section 5, there are two representational extremes one may employ when using graphicalmodels and in particular GMTK for an ASR system. On the one hand, a graph may explicitly represent all theunderlying variables and control mechanisms (such as sequencing) that are required in an ASR system [109]. We callthis approach an “explicit representation” where variables can exist for such purposes as word identification, numericalword position, phone or phoneme identity, the occurrence of a phoneme transition, and so on. In this case, the structureof the graph explicitly represents the interesting hidden structure underlying an ASR system. On the other hand, onecan instead place most or all of this control information into a single hidden Markov chain, and use a single integer

28


frame: 0 {variable : state {

type : discrete hidden cardinality 4000;switchingparents : nil;conditionalparents : nil using MDCPT("pi");

}variable : observation {

type : continuous observed 0:38;switchingparents : nil;conditionalparents : state(0)

using mixGaussian mapping("state2obs");}

}

frame: 1 {variable : state {

type : discrete hidden cardinality 4000;switchingparents : nil;conditionalparents : state(-1)

using MDCPT("transitions");}variable : observation {

type : continuous observed 0:38;switchingparents : nil;conditionalparents : state(0)

using mixGaussian mapping("state2obs");}

}

chunk 1:1;

Figure 20: GMTKL specification of a basic HMM structure. The feature vector in this case is 39 dimensional, andthere are 4000 hidden states. Frame 1 can be duplicated or ”unrolled” to create an arbitrarily long network.

state to encode all contextual information and control the allowable sequencing (Figure 11, top left). We call thisapproach an “implicit” representation.

As an additional example of these two extremes, consider the word “yamaha” with pronunciation /y aa m aa hhaa/. The phoneme /aa/ occurs three times, each in different contexts, first preceding an /m/, then preceding an /hh/,and finally preceding a word boundary. In an ASR system, it must somewhere be specified that the same phoneme/aa/ may be followed only by one of /m/, /h/, or a word boundary depending on the context — /aa/, for example,may not be followed by a word boundary if it is the first /aa/ of the word. In the explicit GM approach, the graphand associated conditional probabilities unambiguously represent these constraints. In an implicit approach, all of thecontextual information is encoded into an expanded single-variable hidden state space, where multiple HMM statescorrespond to the same phoneme /aa/ but in different contexts.

The explicit approach is useful when modeling the detailed and intricate structures of ASR. It is our belief, more-over, that such an approach will yield improved results when combined with a discriminative structure (See above and[6, 11]), because it directly exposes events such as word-endings and phone-transitions for use as switching parents(see Section 6.1.7). The implicit approach is further useful in tempering computational and/or memory requirements.In any case, GMTK supports both extremes and everything in between — a user of GMTK is therefore free to ex-periment with quite a diverse and intricate set of graphs. It is the task of the toolkit to derive an efficient inferenceprocedure for each such system.

6.1.2 The GMTKL Specification Language

A standard DBN [29] is typically specified by listing a collection of variables along with a set of intra- and inter-dependencies which are used to unroll the network over time. GMTK generalizes this ability via dynamic GM tem-plates. The template defines a collection of (speech) frames and a chunk specifier. Each frame declares an arbitraryset of random variables and includes attributes such as parents, type (discrete, continuous), parameters to use (e.g.discrete probability tables or Gaussian mixtures) and parameter sharing. At the end of a template is a chunk specifier(two integers,N : M ) which divides the template into a prologue (the firstN − 1 frames), a repeating chunk, andan epilogue (the lastT −M frames, whereT is the frame-length of the template). The middle chunk of frames is“unrolled” until the dynamic network is long enough for a specific utterance.

GMTK uses a simple textual language (GMTKL) to define GM templates. Figure 1 shows the template of a basic

29

HMM in GMTKL. It consists of two frames each with a hidden and an observed variable, and dependences betweensuccessive hidden and between observed and hidden variables. For a given template, unrolling is valid only if allparent variables in the unrolled network are compatible with those in the template. A compatible variable has the samename, type, and cardinality. It is therefore possible to specify a template that can not be unrolled and which wouldlead to GMTK reporting an error.

A template chunk may consist of several frames, where each frame contains a different set of variables. Using thisfeature, one can easily specify multi-rate GM networks where variables occur over time at rates which are fractionallybut otherwise arbitrarily related to each other.

6.1.3 Inference

The current version of GMTK supports a number of operations for computing with arbitrary graph structures, the fourmain ones being:

1. Integrating over hidden variables to compute the observation probability:P (o) =∑

h P (o,h)

2. Finding the likeliest hidden variable values:argmaxhP (o,h)

3. Sampling from the joint distributionP (o,h)

4. Parameter estimation given training data{ok} via EM/GEM:argmaxθ

∏k P (ok|θ)

A critical advantage of the graphical modeling framework derives from the fact that these algorithms work withanygraph structure, and a wide variety of conditional probability representations. GMTK uses theFrontier Algorithm,detailed in [109, 111], which converts arbitrary graphs into equivalent chain-structured ones, and then executes aforwards-backwards recursion. The frontier algorithm is standard junction-tree inference [60, 84, 25, 61] where thejunction tree is formed via a constrained triangulation algorithm. The triangulation algorithm is equivalent to thevariable elimination algorithm, but where the variable order is constrained such that the variables occur in topologicalorder relative to the original directed model.

The chain structure is useful because it makes it easier to do beam-pruning, to work with deterministic relationshipsbetween variables, and to implement logarithmic space inference.

6.1.4 Logarithmic Space Computation

In many speech applications, observation sequences can be thousands of frames long. When there are a dozen orso variables per frame (as in an articulatory network, see below), the resulting unrolled network might have tens ofthousands of nodes, and cliques may have millions of possible values. A naive implementation of exact inference,which stores all clique values for all time, would result in (an obviously prohibitive) gigabytes of required storage

To avoid this problem, GMTK implements a recently developed procedure [13, 110] that reduces memory require-ments exponentially fromO(T ) to O(log T ) (this has also been called the Island algorithm). This reduction has atruly dramatic effect on memory usage, and can additionally be combined with GMTK’s beam-pruning procedure forfurther memory savings. The key to this method is recursive divide-and-conquer. Withk-way splits, the total memoryusage isO(k logk T ), and the runtime isO(T logk T ). The constant of proportionality is related to the number ofentries in each clique, and becomes smaller with pruning. For algorithmic details, the reader is referred to [110].

6.1.5 Generalized EM

GMTK supports both EM and generalized EM (GEM) training, and automatically determines which to use basedon the parameter sharing currently in use. When there is no parameter tying, normal EM is employed. GMTK,however, has a flexible notion of parameter tying, down to the sub-Gaussian level – in such a case, the typical EMtraining algorithm does not lead to analytic parameter update equations. GMTK’s GEM training is distinctive becauseit provides a provably convergent method for parameter estimation, even when there is an arbitrary degree of tying,even down to the level of Gaussian means, covariances, or factored covariance matrices (see Section 6.1.9).

30

SC

A

B

S=1

S=2

Figure 21: WhenS = 1, A is B’s parent, whenS = 2, B is C ’s parent.S is called a switching parent, andA andBconditional parents.

6.1.6 Sampling

Drawing variable assignments according to the joint probability distribution is useful in a variety of areas ranging fromapproximate inference to speech synthesis, and GMTK supports sampling from arbitrary structures. The samplingprocedure is computationally inexpensive, and can thus be run many times to get a good distribution over hidden(discrete or continuous) variable values.

6.1.7 Switching Parents

GMTK supports another novel feature rarely found in GM toolkits, namely switching parent functionality (also calledBayesian multi-nets [11]). This already was used in Section 6.1.1. Normally, a variable has only one set of parents.GMTK, however, allows a variable’s parents to change (or switch) conditioned on the current values of other parents.The parents that may change are called conditional parents, and the parents which control the switching are calledswitching parents. Figure 21 shows the case where variableS switches the parents ofC betweenA andB, corre-sponding to the probability distribution:P (C|A,B) = P (C|A,S = 1)P (S = 1) + P (C|B,S = 2)P (S = 2).This can significantly reduce the number of parameters required to represent a probability distribution, for example,P (C|A,S = 1) needs only a 2-dimensional table whereasP (C|A,B) requires a three dimensional table. Switchingfunctionality has found particular utility in representing certain language models, as experiments during the JHU2001workshop demonstrated.

6.1.8 Discrete Conditional Probability Distributions

GMTK allows the dependency between discrete variables to be specified in one of three ways. First, they may bedeterministically related using flexible n-ary decision trees. This provides a sparse and memory-efficient representationof such dependencies. Alternatively, fully random relationships may be specified using dense conditional probabilitytables (CPTs). In this case, if a variable of cardinalityN hasM parents of the same cardinality, the table has sizeNM+1. Since this can get large, GMTK supports a third sparse method to specify random dependencies. This methodcombines sparse decision trees with sparse CPTs so that zeros in a CPT simply do not exist. The method also allowsflexible tying of discrete distributions from different portions of a CPT.

6.1.9 Graphical Continuous Conditional Distributions

GMTK supports a variety of continuous observation densities for use as acoustic models. Continuous observationvariables for each frame are declared as vectors in GMTKL, and each observation vector variable can have an arbitrarynumber of conditional and switching parents. The current values of the parents jointly determine the distribution usedfor the observation vector. The mapping from parent values to child distribution is specified using a decision tree,allowing a sparse representation of this mapping. A vector observation variable spans over a region of the featurevector at the current time. GMTK thereby supports multi-stream speech recognition, where each stream may have itsown set of observation distributions and sets of discrete parents.

The observation distributions themselves are mixture models. GMTK uses a splitting and vanishing algorithm dur-ing training to learn the number of mixture components. Two thresholds are defined, a mixture-coefficient vanishingratio (mcvr), and a mixture-coefficient splitting ratio (mcsr). Under aK-component mixture, with component proba-bilities pk, if pk < 1/(K ×mcvr), then thekth component will vanish. Ifpk > mcsr/K, that component will split.GMTK also supports forced splitting (or vanishing) of theN most (or least) probable components at each trainingiteration. Sharing portions of a Gaussian such as means and covariances can be specified either by-hand via parameterfiles, or via a split (e.g., the split components may share an original covariance).

31

Each component of a mixture is a general conditional Gaussian. In particular, thec-component probability isp(x|zc, c) = N(x|Bczc + fc(zc) + µc, Dc) wherex is the current observation vector,zc is ac-conditioned vector ofcontinuous observation variables from any observation stream and from the past, present, or future,Bc is an arbitrarysparse matrix,fc(zc) is a multi-logistic non-linear regressor,µc is a constant mean residual, andDc is a diagonalcovariance matrix. Any of the above components may be tied across multiple distributions, and trained using theGEM algorithm.

GMTK treats Gaussians as directed graphical models [7], and can thereby represent all possible Gaussian factor-ization orderings, and all subsets of parents in any of these factorizations. Under this framework, GMTK supportsdiagonal, full, banded, and semi-tied factored sparse inverse covariance matrices [12]. GMTK can also represent ar-bitrary switching dependencies between individual elements of successive observation vectors. GMTK thus supportsboth linear and non-linear buried Markov models [10]. All in all, GMTK supports an extremely rich set of observationdistributions.

7 The EAR Measure and Discriminative Structure learning Heuristics

This section summarizes different approaches to modifying the structure of a HMM in order to improve classificationperformance. The underlying goal in this endeavor is to augment the structure in an HMM in a structurally discrim-inative fashion. The space of possible models that is spanned by the optimization procedure is also described. Inparticular, the graph structure for an HMM is treated such that the vector observation variables are expanded into theirindividual variables, as in a BMM [10]. These observation vectors, and element variables therein, are augmented withdiscriminative cross-observation edges leading to fewer independence statements made by the model in an attempt toimprove structural discriminability. This is shown in Figure 22. This section also introduces the EAR measure [6],but provides a new and more precise derivation of precisely what assumptions are needed for it to be obtained. Indoing so, a new optimal criterion function for structural discriminability (fast-forward to Equation 4) is derived. Thisderivation, could lead to additional novel heuristics to achieve structural discriminability.

7.1 Basic Set-Up

The basic problem that we consider may be summarized thus: LetQ denote a hidden class variable, taking valuesq;let X denote a (vector-valued) variable comprising a set of observed acoustic features observed at a specific frame;finally let W denote a set of prior observations that may also be useful for determining which classq the given frame.Typically the dimension ofW is too large for all of these features to be incorporated into the predictive model forthe current time frame. Thus we face a model selection problem: find a modelMZ , which incorporates a subset offeaturesZ, (Z ⊆W ), but which gives good predictive performance.

We may formalize this as follows: our goal is to choose a model,MZ from a class of modelsM, such that theresulting fitted distributionpZ(Q | X, W ) maximizes:

Ep(Q,X,W ) log pZ(Q | X, W ). (3)

Since

KL(p(Q | X, W ) || pZ(Q | X, W )) = Ep(Q,X,W ) log(

p(Q | X, W )pZ(Q | X, W )

)

maximizing (3) is equivalent to minimizing the KL-divergence between the true distributionp(Q | X, W ) and thefitted distributionpZ(Q | X, W ).

7.2 Selecting the optimal number of parents for an acoustic feature X

Here we consider the simplest version of the model selection problem. For a givenZ ⊂ W , we define the followingmodel:

MZ = {p∗ | p∗(x, q, w) = p∗(q, w)p∗(x | q, z)} .

This corresponds to a graphical model in whichQ,W form a clique, and the parents ofX areQ andZ. MZ is simplythe set of model distributions in whichX⊥⊥W \ Z | Q,Z holds. Note that whileX⊥⊥W \ Z | Q, Z would hold for

32

Q

X

Z1

Z2

W

Figure 22: Basic set-up: we wish to find a setZ ⊂ W of fixed dimension of parents forX which will lead to anoptimal classification model forQ givenX andW . Q is typically a hidden variable at one time step of an HMM, andX are the feature vectors at the current time point.W are the set of all possible additional parents ofX that couldbe chosen, andW might consist either of collections ofX vectors from an earlier time, or could consist of entirelydifferent features that are not ordinarily used for anX at any time.

a particular model that been selected, it is not necessarily the case thatX⊥⊥W \ Z | Q,Z is correct according to thetrue generative modelp(x, q, w). As mentioned in Section 4, this is not a concern as the goal here is only to obtaingenerative models that discriminate well.

We now define the model class:Mc = {MZ where|Z| = c}

This is simply the set of graphical models in whichX has exactlyc parents in addition toQ, andQ,W forms aclique. For a given modelMZ we will let pZ denote the fitted distribution, given data, under the modelMZ . SinceQ,W forms a clique, the fitted distribution and the true distribution are the samepZ(Q,W ) = p(Q, W ).2 Similarly,pZ(X | Q,Z) = p(X | Q,Z) since we allow for the variablesX to form a clique. (If we are fitting a parametric ratherthan a non-parametric sub-model ofMZ then these last two equations will not hold; we return to this point below.)Note that for models withinMZ we have that

pZ(X, W ) =∑

q

pZ(X, q,W ) =∑

q

p(q, W )p(X|q, Z).

2When integrating with respect to the true distributionp, and the model is non-parametric, it is only conditional independence statements whichdistinguish the model and the true distribution. For a clique, there are no independence statements.

33

Now,

Ep log pZ(Q | X, W ) = Ep log pZ(Q,X, W )− Ep log pZ(X, W )

= Ep log pZ(X | Q,W ) + Ep log pZ(Q,W )−Ep log pZ(X, W )

(∗) = Ep log pZ(X | Q,Z) + Ep log pZ(Q,W )−Ep log pZ(X, W )

= Ep log p(X | Q,Z) + Ep log p(Q,W )−Ep log pZ(X, W )

= I(X;Z | Q) + Ep log p(X | Q) + Ep log p(Q,W )+KL(p(X, W ) || pZ(X, W ))− Ep log p(X, W )

where the step marked(∗) follows from the conditional independence assumption assumed in the modelMZ . Disre-garding terms which do not depend onZ we then see that the optimalZ is that which maximizes:

I(X;Z | Q) + KL(p(X, W ) || pZ(X, W )). (4)

Thus, in words, we wish to find the setZ which maximizes the conditional mutual information betweenX andZgivenQ, but at the same time which maximizes the KL divergence betweenp(X, W ) andpZ(X, W ).

Note that the only assumption that we have made so far is that

pZ(X | Q,Z) = p(X | Q,Z) andpZ(Q, W ) = p(Q,W )

These assumptions will hold true if we are fitting a non-parametric model, e.g. as would be the case if we were fittingdiscrete Bayes Nets. However, note that ifpZ(Q,W ) 6= p(Q,W ), but all modelsMZ agree on the distribution ofp(Q,W ) (i.e., they all use the same distribution over variablesQ andW ), then the expression obtained in equation (4)would still select the optimal model under criterion (3), since in this case the termEp log pZ(Q,W ) does not dependonZ. More formally, criterion (4) is correct if we are considering selecting among models:

M∗Z = {p∗ | p∗(x, q, w) = p0(q, w)p∗(x | q, z)} .

wherep0(q, w) is a fixed distribution. This would be the case with learning structure in an HMM in which the modelfor p(Q,W ) was already fixed.

7.3 The EAR criterion

Expanding the KL-divergence term in (4) which depends onZ we obtain the following:

−Ep log pZ(X, W ) = Ep log∑

q

pZ(q, X,W )

= −Ep log∑

q

pZ(X | q, W )pZ(q |W )− Ep log pZ(W )

= −Ep log∑

q

pZ(X | q, Z)pZ(q |W )− Ep log pZ(W )

= −Ep log∑

q

p(X|q, Z)p(q |W )− Ep log p(W )

where in the last line we use the fact thatpZ(X | Q,Z) = p(X | Q,Z), pZ(Q | W ) = p(Q | W ) and pZ(W ) =p(W ).

34

If Q⊥⊥(W \ Z) | Z in the true distributionp, so thatp(q | W ) = p(q | Z), then the sum in the last expressionbecomes:

−Ep log∑

q

p(X|q, Z)p(q | Z) = −Ep log p(X | Z)

= −I(X;Z)− Ep log p(X)

Thus ifQ⊥⊥(W \ Z) | Z in the true distribution then maximizing(4) is equivalent to maximizing:

EAR(Z) = I(X;Z | Q)− I(X;Z) (5)

This is the Explaining Away Residual (EAR) criterion proposed by Bilmes (1998). Since

I(X;Z | Q)− I(X;Z) = I(X;Q | Z)− I(X;Q)

and the last term on the RHS does not depend onZ, a criterion which is equivalent to the EAR criterion results fromselecting the setZ which maximizes:

EAR∗(Z) = I(X;Q | Z)

If Q6⊥⊥(W \Z) | Z in the true distributionp, then we are no longer guaranteed that the setZ which optimizes the EARcriterion (5) will optimize our objective (3). Though it is unlikely to hold exactly in the true distribution, it may holdapproximately in contexts where the featuresW relate to noise, which is independent of the stateQ.

One obvious approach to assessing whether or notQ⊥⊥(W \ Z) | Z, would be to calculateI(Q;W \ Z | Z) fordifferent choices ofZ. In particular, it would seem to be of concern ifI(Q;W \ ZEAR | ZEAR) >> 0 whereZEAR

is the set which optimizes the EAR criterion.Note that the EAR criterion (5) will also be equivalent to (3) in the case whereX⊥⊥Q | Z in the true distribution

p, since in that caseEp log

∑q

p(X|q, Z)p(q |W ) = Ep log p(X|Z)

However, this independence seems rather unlikely to hold in practice.

7.4 Class-specific EAR criterion

The EAR criterion may be adapted to select class-specific sets of parents, ‘switching parents’, as follows:

EARq(Z) = I(X;Z | Q = q)− I(X;Z)

Class-specific sets of parents allow the corresponding statistical model to encode ‘context-specific independence’(CSI) constraints, of the form:

X⊥⊥Y | Z = z

7.5 Optimizing the EAR criterion: heuristic search

In principle, optimizing the EAR criterion ‘simply’ requires us to calculateI(X;Z|Q) and I(X;Z). In practice,whenZ represents a set of covariates the calculation of these quantities for a large speech corpora is computationallyintensive. Since each feature vector typically contains between 20 and 40 components, and there are thought to belong-range dependencies at time lags of up to 150 ms (and 10ms per time-frame), the setW of potential parents for agivenX variable may contain thousands of candidate covariates. Under these circumstances it is infeasible to optimizethe EAR criterion directly. This motivates the use of heuristic search procedures.

A greedy algorithm provides a simple heuristic search procedure for finding a set of sizek:

(a) SetZ = ∅.

(b) Find the variableU ∈W \ Z for whichEAR(Z ∪ {U}) is maximized. AddU to Z.

(c) Repeat step (b) untilZ has dimensionk.

35

7.6 Approximations to the EAR measure

The approach just described still suffers from the disadvantage that calculation of the EAR measure requires calcu-lation of joint mutual information between vectors of variables, which in turn requires multivariate joint densities tobe computed. This was infeasible for the speech recognition tasks that were considered during the Johns Hopkinsworkshop, given the time and resources that were available. Consequently we investigated simple approximations tothe EAR measure that only required the calculation of information between scalars.

7.6.1 Scalar approximation 1

The first variableZ1 was selected by evaluating the EAR criterion, which does not require evaluating informationbetween vectors of variables.

The second variableZ2 was found by finding the variable with the highest EAR criterion, among the remainingvariables (W \{Z1}) which at the same time did not appear ‘redundant’ in that it satisfied at least one of the followinginequalities:

I(X;Z2 | Q) > I(X;Z1 | Q) or I(X;Z2 | Q) > I(Z1;Z2 | Q) (6)

This is depicted in Figure 23, where edge-thickness roughly corresponds to mutual-information value.

Q

XZ1

Z2

Q

XZ1

Z2

Figure 23: Heuristic: conditions under which a second variableZ2 are not considered redundant with the first variableZ1 added. Edge thickness corresponds roughly with mutual-information magnitude.

The motivation behind verifying these inequalities was as follows:if

Z2⊥⊥X | Z1, Q

then addingZ2 to X andZ1 as parents ofX, will not change the resulting model. This conditional independenceimplies the reverse inequalities via the information processing inequality:

I(X;Z2 | Q) ≤ I(X;Z1 | Q) and I(X;Z2 | Q) ≤ I(Z1;Z2 | Q)

Consequently, if at least one of the inequalities (6) fails to hold then this at least implies that the conditional indepen-dence does not hold.

Similarly if a third variable is required then we selected the variable with the highest EAR measure which satisfiesat least one of the inequalities

I(X;Z3 | Q) > I(X;Zi | Q) or I(X;Z3 | Q) > I(Zi;Z3 | Q)

for each ofi = 1 andi = 2 . However, in practice we found that this rule was not helpful in selecting additionalparents. We also considered schemes in which these redundancy tests were traded-off. However, these schemes didnot lead to notable improvement in recognition. Note, however, that these redundancy criterion are not discriminativein that they ask only for redundancy with respect to conditional mutual information, and not the EAR measure (andcertainly not Equation 4). Therefore, it is not entirely surprising that these conditions showed little effect.


The second approximation method that we used was to rank variables based on the EAR measure, but simply to rejectthose for which the marginal information was too large. i.e. we eliminated from consideration those covariatesZ∗ forwhichI(X;Z∗) was greater than a threshold. Typically, this threshold was selected by calculating this quantity for allcovariatesZ∗ that are under consideration and then choosing an appropriate percentile.

36


The third approximation was the simplest of them all: choose the topn variables ranked by the EAR measure. Thismeasure proved to perform about as well as the heuristics above, and was used for all word error experiments describedin Sections 9 and 10.

7.7 Conclusion

The main findings from the exploration of methods for discriminative learning of structure were as follows:

• The EAR method performed well in selecting a single additional parent for each feature;

• Adding class-specific parents via the class-specific EAR measure did not lead to significant improvements inperformance; This stands in contract, however, to previous work [6, 11] where a benefit was obtained withclass-specific parents, using the EAR measure along with a randomized edge selection algorithm;

• To select additional parents it is necessary to evaluate mutual information between vectors of variables – it isnot possible to judge the relevance of a variable simply by making comparisons between scalars.

8 Visualization of Mutual Information and the EAR measure

As described in Section 7, even the simplest scalar version of the EAR measure requires the computation of mutualinformation and conditional mutual information on a wide collection of pairs of scalar elements of speech featurevectors. Before we present new word error results using this measure, it is elucidative in its own right to visualize suchmutual information and the EAR measure on a set of quite different types of speech feature vectors. As will be seen,the degree to which these visualizations show large magnitude EAR-measure values should correspond roughly tothe degree to which discriminative HMM structure augmentation should improve word error performance in a speechrecognition system.

First, it must be noted that even the simple pairwise mutual informationI(X;Z|Q) whereX andZ are scalars isobtained from a three-dimensional grid.X really meansXti wheret is the current time-position,i is theith elementof the feature vector at timet, andZ really meansZt+τ,j whereτ is the time-lag between the current time positionand the candidateZ-variable, andj is the position in the vector at timet + τ . This is shown in Figure 24. In general,we consider the variables at timet the child variables (i.e., the child at timet and positioni), and the variables at timet + τ the possible parent variables at positionj. We callτ the lag.

time frame ->

feature position

tt-τt+τ,jX �

,t iX

Figure 24: EAR measure computation. The set of possible pairs of variables on which pair-wise mutual information(both conditional and unconditional) needs to be computed. This can often be thousands of pairs of variables.

It can therefore be seen that to visualize mutual information, conditional mutual information, or the EAR measure,it must be possible to visualize volumetric data, plotting the quantities as functions ofi, j, andτ . One way to do thiswould be to represent slices through this volume at various fixedj, j, or τ . Another way might be to average acrosssome dimension and project down onto a diagonal plane. This was used in [6]. For the purposes of the workshop,we found it most simple to average across a dimension and then project down onto one of the three planar axes, asdepicted in Figure 25. This leads to three 2-dimensional color plains as described in the next three paragraphs.

37

First, thej : τ plot is the average MI as a function of parent and time-lag. This plot therefore shows the relativevalue of a parent overall at a given time-lagτ and parent positionj. If a value for a particularj, τ is large, then thatparent at that time will be overall beneficial.

Next, thei : τ plot is the average MI as a function of child and time-lag. This plot shows the degree to whicha child variable at positioni at timet is “fed” useful information by all variable on average at timet + τ . This plottherefore shows the relative benefit of each time lag for each child.

Finally, thei : j plot corresponds to the average MI as a function of parent-child. This plot therefore providesinformation about the the most useful parent for each child overall, irrespective of time. If a MI value of a particularparent is low for a given child, then this parent will never be useful for that child. On the other hand, if the the MIvalue is large, there will be many instances in the set of candidate variables where useful information about the childmay be found. This view, of course, does not indicate the degree to which multiple parents might be redundant withone another, as only pair-wise scalar mutual information is calculated.

Below, these three plots will be shown in a row, where the firstj : τ plot is on the left, the secondi : τ plot shownin the middle, and the thirdi : j plot shown on the right.

lag

jparent

i child

parent-lag plot

child-lag plot

parent-child

plot

Figure 25: EAR measure projections.

Note that the above descriptions were in terms of mutual information (MI) (i.e.,I(X;Z)). The three 2-dimensionalplots could equally well describe conditional mutual informationI(X;Z|Q), or the EAR measureI(X;Z|Q) −I(X;Z). In general, we have terms these plots “jet plots” for obvious reasons.

We produced jet plots for a differing feature sets (MFCCs, IBM LDA-MLLT features, and novel AM/FM-basedfeatures), differing corpora (Aurora 2.0, an IBM Audio/Video corpus, and the DARPA Speech in Noisy EnvironmentsOne (SPINE-1) corpus), and differing hidden conditions (Q corresponding to phones, sub-phones, whole-words, andgeneral HMM states). These comparisons in a single document allow the examination of the relative differencesbetween different corpora, conditions, etc. The various corpora and features are described below.

8.1 MI/EAR: Aurora 2.0 MFCCs

The first row of plots is given in Figure 26.3. This plot shows the conditional mutual informationI(X;Z|Q) for theAurora 2.0 corpus [56] whereQ corresponds to different phones, as defined in the standard Aurora distribution. Aurora2.0 is further described in Section 10.1. The plot are for MFCCS, and their first and second derivatives. Specifically,the first 12 features correspond to cepstral coefficients c1 through 12, the next feature is c0 and which is followed bylog-energy. The deltas for these 14 features (in the same order) come next, followed by the double-deltas.

3If these plots are fuzzy and you are reading them on paper form, we suggest that you obtain an electronic PDF version of the document withwhich it is possible to zoom in quite closely, as the plots are included within the document in reasonably high resolution.

38

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), Aurora Phone MFCCs

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

Child Feature Position

Par

ent F

eatu

re P

ositi

on


Figure 26: Aurora Phone

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), Aurora Phone MFCCs

−0.01

−0.005

0

0.005

0.01

0.015

0.02

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 27: Discriminative Aurora Phone

The plots show a number of things. First, the left-most plot shows that the parents with the most mutual informationin general come from c0 and log-energy, and their deltas (and to some extent double deltas). Similarly, from the middleplot, the children which receive the most information from any parent at a given time are also c0 and log-energy (anddeltas). This is confirmed by the third plot, which therefore states that, in absolute terms, most of the information isbetween c0 and log-energy and its time-lagged variants. There is also information between a c0 (or log-energy) and itsdelta and double delta versions, more at least than the other features. Furthermore, most of the information is closerrather than farther away from the base time positiont. It can also be seen from the diagonal lines in the right-most plotthat features in general tend to share MI with lagged versions of themselves and with their derivatives. There is in factinformation between different features, but it is at a lower magnitude than c0 and log-energy and other features, and istherefore difficult to see from this plot alone using this colormap.

The EAR-measure (discriminative) version of these plots is shown in Figure 27. The first obvious thing to noteis that discriminative MI is quite different from non-discriminative MI, suggesting that the choice of discriminativeMI might have a significant impact on structure learning. In the left two EAR plots, for example, the same featuresremain that were valuable in the non-discriminative plots remain value, but only at much greater time lags. In fact, thecloser-time lags of these features seem to indicate that these would be some of the least discriminative edges to use.This trait extends to features other than c0 and log-energy: specifically, edges between neighboring features do nottend to have an advantage over their distant counterparts. Moreover, looking at the right-most EAR plot, the lack of aclear diagonal indicates that edges between different feature positions appear to have more discriminative benefit thanthose between corresponding features.

The next set of plots are shown in Figure 28 and Figure 29. The plots appear to be similar to the ones above. Thisindicates that the change in hidden conditioning (i.e.,Q moving from a phone random variable to a sub-phone HMM-state random variable) would not have a large impact on the structures that would be most beneficial in augmentingan HMM. Note also that the overall range of the EAR measure appears to be lower in the phone-state plots than inthe phone plots. This might indicate that, conditioned on the phone-state, there would be less utility in an augmentedstructure.

The next set of plots are shown in Figure 30 and Figure 31. These plots show the case whenQ corresponds to an

39

0.2

0.25

0.3

0.35

0.4

0.45

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), Aurora Phonestate MFCCs

0.15

0.2

0.25

0.3

0.35

0.4

0.45

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 28: Aurora Phone State

−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), Aurora Phonestate MFCCs

−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.15

−0.1

−0.05

0

0.05

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 29: Discriminative Aurora Phone State

actual word (one of the words in the Aurora 2.0 vocabulary of size 11) rather than a phone or phone-state. Again, theplots appear to be similar to the ones above. This indicates that the change in hidden conditioning (i.e.,Q moving froma phone random variable to a sub-phone HMM-state random variable) might not have a large impact on the structuresthat would be most beneficial in augmenting an HMM. Note, however, that the overall range of EAR values is in thiscaselarger than in the two previous case. This would appear to be encouraging.

Lastly, for the Aurora 2.0 MFCC plots, Figure 32 and Figure 33 show the case whereQ corresponds to whole-word states. I.e., the HMM models in this case use entire words, but the conditioning set of the random variableQcorresponds to all of the possible states within each of the words. Once again, the patterns are the same, and in thiscase the EAR range seems to be the lowest of the set so far.

0.2

0.25

0.3

0.35

0.4

0.45

0.5

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), Aurora Word MFCCs

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 30: Aurora Word

40

0

0.005

0.01

0.015

0.02

0.025

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), Aurora Word MFCCs

0

2

4

6

8

10

12

14

16

18

x 10−3

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 31: Discriminative Aurora Word

0.2

0.25

0.3

0.35

0.4

0.45

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), Aurora Wordstate MFCCs

0.2

0.25

0.3

0.35

0.4

0.45

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 32: Aurora Word State

8.2 MI/EAR: IBM A/V Corpus, LDA+MLLT Features

The next set of plots correspond to a heavily pre-processed set of feature vectors that were created at IBM-research.These features were computed on the IBM parallel Audio/Video corpus [49], although the plots shown here onlyinclude the audio portion of the corpus. While we had originally wanted to compute cross-stream dependenciesbetween the audio and visual portions of the corpus, the 6-week time limitations prevented us from doing that.

The audio stream features were obtained by training a linear discriminant analysis (LDA) transforms of 9 frames ofthe cepstral coefficients, followed by a maximum likelihood linear transform (MLLT) [48]. The video stream featuresin this feature set were obtained by an LDA-MLLT transform of the pixels in a region of interest around the mouth asdescribed in [72].

The LDA-MLLT jet plots also demonstrated a striking difference between the EAR and non-discriminative MIplots. The first thing to note is that the magnitude of the MI plots is on average less than any seen so far with the

−0.04

−0.03

−0.02

−0.01

0

0.01

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), Aurora Wordstate MFCCs

−0.04

−0.03

−0.02

−0.01

0

0.01

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.2

−0.15

−0.1

−0.05

0

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 33: Discriminative Aurora Word State

41

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

10

20

30

40

50

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), Audio−Visual (Audio Only)

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

10

20

30

40

50

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.2

0.4

0.6

0.8

1

0 10 20 30 40 500

10

20

30

40

50


Par

ent F

eatu

re P

ositi

on


Figure 34: IBM Audio-Visual Data, Processed Features

−0.04

−0.035

−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

10

20

30

40

50

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), Audio−Visual (Audio Only)

−0.045

−0.04

−0.035

−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

10

20

30

40

50

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.2

−0.15

−0.1

−0.05

0

0 10 20 30 40 500

10

20

30

40

50


Par

ent F

eatu

re P

ositi

on


Figure 35: Discriminative, IBM Audio-Visual Data, Processed Features

Aurora MFCC plots. While this could be an issue with the data, it could also indicate that in general these featureshave less overall information available. The EAR plots are also informative in their range. It appears that most of theEAR values in these plots are negative, indicating that these features have been pre-processed to the point that theyare likely to produce little if any gain when adding cross-observation edges to the model. Unfortunately, we did notgenerate the cross audio-video MI or EAR plots which could show potentially useful and non-redundant cross-streaminformation.

8.3 MI/EAR: Aurora 2.0 AM/FM Features

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), Aurora Whole−Word State Phonetact

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 36: Aurora Word State, Phonetact Features

We also applied novel AM/FM features provided by Phonetact inc. to this analysis. We applied these features to theAurora whole-word state case. We computed AM (Amplitude Modulation) and FM (Frequency Modulation) features

42

−0.15

−0.1

−0.05

0

0.05

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), Aurora Whole−Word State Phonetact

−0.15

−0.1

−0.05

0

0.05

−150 −100 −50 00

5

10

15

20

25

30

35

40

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.1

−0.05

0

0.05

0.1

0.15

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40


Par

ent F

eatu

re P

ositi

on


Figure 37: Discriminative Aurora Word State, Phonetact Features

4. These are computed by dividing the spectrum into20 equally spaced bands using multiple complex quadrature bandpass filters. For each neighboring pair of filters, the higher-band filter output is multiplied by the conjugate of thelower-band output. The result is low-pass filtered and sampled every 10ms. The FM features are the sine of the angleof the sampled output, and the AM feature is the log of the real component. Although we expect that these featurescould be improved by further processing (e.g. cosine transform, mean subtraction, derivative-concatenation) we usedthe raw features to provide the maximum contrast with MFCCs.

The plots are shown in Figure 36 and Figure 36. In the plots below, the FM features fill the lower half of the plot,while AM fill the upper half. The difference between discriminative and non-discriminative jet plots is perhaps mostvisible in this case. In the left two non-discriminative MI plots, the most MI is found in the AM Phonetact features–thetop half of the features–at a time lag of up to -50. The FM have relatively less MI. In the left two EAR plots, however,the AM features’ offer significant MI only at time lags earlier than around 50ms, and the FM go from offering littlehelp in the MI case to potentially providing useful discriminative information in this case.

The right-most MI plot shows a great deal of energy between parent and children AM features, especially whenparents and children share the same feature number; the rest of the plot shows less. There appears to be little cross-information between the AM and FM features. In the third discriminative plot, however, the entire plots shows littleenergy, and the regions have become fairly homogeneous. As will be seen in the experimental section, we use thesefeatures as conditional only features (i.e., they are used as additional featuresW as described in Section 7.

8.4 MI/EAR: SPINE 1.0 Neural Network Features

We also computed MI and EAR quantities on the SPINE 1.0 corpus using neural network-based features [97] (these arefeatures where a three-layer multi-layered perceptron is trained as a phone classifier on a 9-frame window of speechfeatures, and then the output of the network is used as features, before the non-linearity). The two plots are shown inFigure 38 and Figure 39.

In the MI case (Figure 38), we can see that most of the information about the features lies at 10 ms into the past(i.e., the previous frame). This is probably a result of the fact that the features are meant to predict phonetic category,and there is significant correlation between successive categorical phonetic classes (i.e., if a phone occurs at a frame,it is more likely than not to occur at the previous and next frame). Other than that, we can see that some of the featuresseemed to have more temporal correlation than others, possible a result of the phonetic category of the feature (i.e.,vowels we would expect would extend over a wider time region). Unfortunately, the original phonetic labels of thefeatures were not available to us at the workshop, so we could not verify this hypothesis. Lastly, on the right-mostplot, we see that the features seem to be most correlated with themselves.

In the EAR plot case (Figure 39), the biggest difference seems to be that it is no longer the case that the previousframe (10ms into the past) provides the most useful information, as is not surprising. The magnitudes of EAR for theseplots also seemed encouraging. Unfortunately, time-limitations prevented us from learning discriminative structure onthese features.

4We thank Y. Brandman of Phonetact, Inc. for providing this technology.

43

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

5

10

15

20

25

30

35

40

45

50

55

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q), SPINE

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

5

10

15

20

25

30

35

40

45

50

55

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


0.2

0.4

0.6

0.8

0 5 10 15 20 25 30 35 40 45 50 550

5

10

15

20

25

30

35

40

45

50

55


Par

ent F

eatu

re P

ositi

on


Figure 38: SPINE-1

0.006

0.008

0.01

0.012

0.014

0.016

0.018

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

5

10

15

20

25

30

35

40

45

50

55

Time Lag (ms)

Par

ent F

eatu

re P

ositi

on

Average I(X;Z|Q)−I(X;Z), SPINE

0.006

0.008

0.01

0.012

0.014

0.016

0.018

−100 −90 −80 −70 −60 −50 −40 −30 −20 −10 00

5

10

15

20

25

30

35

40

45

50

55

Time Lag (ms)

Chi

ld F

eatu

re P

ositi

on


−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0 5 10 15 20 25 30 35 40 45 50 550

5

10

15

20

25

30

35

40

45

50

55


Par

ent F

eatu

re P

ositi

on


Figure 39: Discriminative SPINE-1

9 Visualization of Dependency Selection

For many of the experiments that were performed during the workshop, we used the EAR measure (see Section 7,and visualized in Section 8) to form structurally discriminative models (Section 4). Some of the structures that wereformed were ultimately used in WER experiments (to be reported in Section 10), but because of time-constraints someof the structures were not used. In this section, we present some of the structures that were formed and in doing sodescribe the format used to visualize these structures.

Some of these figures will be described further in later sections, when the word errors of the resulting models arediscussed.

In general, the edge augmentation heuristics in Section 7.6 meant that the goal was to select the most discriminativerelationships (edges) between past ”parent” variables and the present ”child” variables. A goal was to visualize whichrelationships were strongest, so we created so-called ”d-link” (dependency link) plots.

All of the d-link plots can be interpreted as follows: The horizontal axis indicates time, there the right-most positionis time t, and moving to the left increases inτ to positiont + τ . The axis is labeled with time-frames, each frameuses a 10ms step. Therefore, the axis is in units of 10ms time chunks. The vertical axes indicates feature position, themeaning of each depends on the features. Since most of the plots we show here are for MFCCs, the meanings (i.e.,the relative MFCC feature order) are the same as that described in Section 8. For each child feature in the current timeslice (timet, the right-most column of features) either the one or two most informative parents are shown. This meansthat either the top or the top two children are chosen according to the EAR measure, using the heuristic described inSection 7.6.3.

The parents are shown by a colored rectangle at locationt+τ and positionj in the plot. In general, a darker parentrectangle indicates that a parent has a strong relationship with more than one child (yellow=1 child, green=2, blue=3,and black=4). Also, the thickness of the line between parent and child indicates the magnitude of the EAR measurefor that parent-child pair. We now describe the main features of these plots.

Figure 40 shows the Aurora 2.0 whole-word state MFCCs dlink plots, where only the top parent according to theear measure for each child is displayed. There are number of interesting features of this plot.

44

−12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041

Time Lag

Feat

ure

Aurora Whole−Word State MFCCs, One Parent per Child

Figure 40: Dlink plots for the Aurora 2.0 corpus, MFCCs, whereQ corresponds to whole-word HMM states. Onlyone parent per child is shown. 10ms time frames.

First, the small number of purely horizontal edges suggests that child variables infrequently asked for lagged, pastversions of themselves as parents. This is in contrast to the case if pure MI was used, where it would often be the casethat a child would use as a parent the corresponding feature at, say, one time-frame past. The cepstral feature c0 andlog-energy feature, were two of the exceptions.

Second, features c1 through c12 (features 0-11) very often took the second derivative of c1 through c12 (features14-25) as parents and vice versa. There was not a similar relationship to the first derivative of c1 through c12, nor wasthere a similar relationship between log-energy and either the first or second derivative of log-energy. This seems toindicate that 2nd derivative features from the past are discriminatively informative about features at timet. Note alsothat the delta features rarely if ever used the non-delta features. As argued in Section 4, such edges would be amongthe least discriminative, and could potentially hurt performance.

Third, the time lags between parents and children were somewhat surprising. There were more parents at a lag ofτ = 3 than at shorter time lags, and there were more parents around a lag ofτ = 9 to 12 than at lag times between 3and 9.

Finally, at long time lags (say over 100 ms), only the parent features c1 through c12 and log-energy were informa-tive; derivative features were not. It appears therefore that overall long-term (+100ms) contours usefully contributesto the discriminability of the class. Perhaps interestingly, these are typical lengths of syllables rather than phones.

In general, we also created both one- and two-parent d-link plots, but we found that the number of parents didnot have a strong effect on the underlying word-error results. Figure 41 shows the dlink plot in the case of Aurora2.0 whole-word state MFCCs, where the top two parents according to the EAR measure for each child are displayed.Perhaps the main feature of this plot, relative to Figure 40, is that c0 and log-energy children continue to desire parentsat long-time scales. In this case, the parent goes back 150ms!

Figure 42 shows the dlink-plots in the case whereQ corresponds to a phoneme on Aurora 2.0, and again twoparents per child are shown. For Aurora phone state MFCCs, the d-link plots were largely similar to the whole-wordstate plots. One of the few noticeable differences is that the smattering of features c1-c12 (features 0-11) at time lagsfrom -9 to -13 found with whole-word state MFCCs was absent with phone state MFCCs. Figure 43 show the same,but whereQ is the phone-state (portions of a phoneme). Figure 44 shows the dlink plot whenQ correspond to entireAurora 2.0 vocabulary words at one parent per child, and Figure 45 is the same with two parents per child. All ofthese plots show similar phenomena, namely that which are described above, and a continued desire for cepstral c0and log-energy to have long-range parents.

45

−15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041

Time Lag

Feat

ure

Aurora Whole−Word State MFCCs, Two Parents per Child

Figure 41: Dlink plots for the Aurora 2.0 corpus, MFCCs, whereQ corresponds to whole-word HMM states. Twoparents per child shown. 10ms time frames.

Figure 46 shows the d-link plots for the combined feature set of the Aurora 2.0 MFCCs (features 0-41), and thePhonetact AM and then FM features (range 42-90) in that order. The MFCC features have colored rectangles at timet indicating that they have child variables. The Phonetact features were used as purely conditional random variables(i.e., they were only added as members ofW , as described in Section 7). It is interesting to note that the Phonetactfeatures appear to be more discriminative when used in this context the MFCC features themselves were. Features60 and 61 were particularly informative, with parents at a range of time lags. More details are provided in the nextsection.

10 Corpora description and word error rate (WER) results

10.1 Experimental Results on Aurora 2.0

The experimental results described in this section focus on the Aurora 2.0 continuous digit recognition task [56]. TheAurora database consists of TIDigits data, which has been additionally passed though telephone channel filters, andsubjected to a variety of additive noises. There are eight different noise types ranging from restaurant to train-stationnoise, and SNRs from -5dB to 20dB. For training, we used the “multi-condition” set of 8440 utterances that reflectthe variety of noise conditions. We present aggregate results for test sets A,B, and C, which total about 70,000 testsentences [56].

We processed the Aurora data in two significantly different ways. In the first, we used the standard front-endprovided with the database to produce MFCCs, including log-energy andC0. We then appended delta and double-deltafeatures and performed cepstral mean subtraction, to form a 42 dimensional feature vector. In the second approach,we computed AM (Amplitude Modulation) and FM (Frequency Modulation) features as described in Section 8.3.

10.1.1 Baseline GMTK vs. HTK result

To validate our structure-learning methods, we built baseline systems (with GMTK emulating an HMM), and thenenhanced them with the discriminative structures shown above. The first set of experiments therefore consisted onlyof baseline numbers. In particular, since GMTK was entirely new software, we wanted to ensure that the baselineswe obtains with GMTK matched that of the standard HTK Aurora 2.0 baseline provided with the Aurora distribution.

46

−13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041

Time Lag

Feat

ure

Aurora Phone MFCCs, Two Parents per Child

Figure 42: Dlink plots for the Aurora 2.0 corpus, MFCCs, whereQ corresponds to individual phones. Two parentsper child shown. 10ms time frames.

clean 20 15 10 5 0 -5

GMTK 99.2 98.5 97.8 96.0 89.2 66.4 21.5PH 99.1 98.3 97.2 94.9 86.4 54.9 2.80HP 98.5 97.3 96.2 93.6 85.0 57.6 24.0

Table 1: Word recognition rates our baseline GMTK system as a function of SNR. PH are the GMTK phone models.HP is reproduced from [56].

This is shown in Figure 48. The figure shows word accuracies for several signal-to-noise rations, and several differentbaseline systems, HTK with whole-word models as specified in [56] (each of the 11 vocabulary words uses 16 HMMstates with no parameter tying between states, as in the Aurora 2.0 release). Additionally, silence and short-pausemodels were used, with three silence states and the middle state tied to short-pause. All models were strictly left-to-right, and used 4 Gaussians per state for a total of 715 Gaussians.

The GMTK baseline numbers simulated an HMM and either 1) used whole-word models thereby mimicking theAurora 2.0 HTK baseline, or 2) used a tied mono-phone state models. We examine and compare these results inFigure 49 (numbers also shown in Table 1) which shows the improvements relative to the HTK baseline. These resultsshow the specific absolute recognition rates for our GMTK baseline systems as a function of SNR, averaged acrossall test conditions. Also presented is the published baseline result [56] with a system that had somewhat fewer (546)Gaussians (the GMTK system used 4 Gaussians per state rather than 3 because we in this case set up the Gaussiansplitting procedure to double the number of Gaussians after each split. In newer experiments, it was found that GMTKwas slightly better than HTK with the exact same model structure [20], possibly because of GMTK’s new Gaussianhandling code). As can be seen, we see that overall the results are comparable with each other. We also see that,in general, the word-state model seems to outperform the phone-state model (which is not unexpected since in smallvocabulary tasks, each word having its own entire model is often useful).

10.1.2 A simple GMTK Aurora 2.0 noise clustering experiment

Another set of experiments that were run regard simple noise clustering in the Aurora 2.0 database. Essentially, thestructure in Figure 11 is augmented to include a single hidden noise variable per frame, as shown in Figure 50. The

47

−15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041

Time Lag

Feat

ure

Aurora Phonestate MFCCs, Two Parents per Child

Figure 43: Dlink plots for the Aurora 2.0 corpus, MFCCs, whereQ corresponds to individual phone HMM states.Two parents per child shown. 10ms time frames.

noise variable is observed during training (and indicates the noise type of the training data), and is hidden duringtesting hopefully allowing the underlying noise-specific model to be best used in the right context.

Figure 51 shows the results as “improvements” relative to the GMTK phone-based baseline and for different SNRlevels. Unfortunately, we do not see any systematic improvement in the results, and if anything the results get worseas the SNR increases. There are several possible reasons for this: first, the number of Gaussians in the noise-clusteredmodel had increased relative to the baseline, and therefore had less training data. Second, it was still the case thatall noise conditions were integrated against during decoding. Lastly, this structure was obtained by hand and notdiscriminatively. Perhaps the data was indeed modeled better, but the distinguishing features of the words was not, asargued in Section 4.

10.1.3 Aurora 2.0 different features

Yet another set of GMTK baseline experiments we performed regarded the relative performance of different featuresets on Aurora 2.0, and the results are shown in Figure 52. The figure compares a system that uses MFCCs with asystem that uses MFCCs augmented with other features (various combinations of raw AM and raw FM features). TheAM/FM features were raw in that there was no normalization, smoothing (subtraction), discrete cosine transform, orany other post-processing that often shows improved results with Gaussian-mixture HMM-based ASR systems. Ourgoal was to keep the AM/FM features as un-processed as possible in the hope that the discriminative structure learningwould find an appropriate structure over those features, rather than having the features (via post-processing) conformto that which a standard HMM finds most useful.

10.1.4 Mutual Information Measures

The next step of our analysis was a computation of the discriminative mutual information (i.e., the EAR measure)between all possible pairs of conditioning variables, as described in Section 8. Although we could compute this forhidden variables as well as observations (see also Section 12.2 on the beginnings of a new mutual-information toolkitwhich would solve this problem), for expediency and simplicity we focused on conditioning between observationcomponents alone. Thus, the structures we present later are essentially expanded views of conditioning relationshipsamong the individual entries of the acoustic feature vectors.

48

−13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041

Time Lag

Feat

ure

Aurora Whole−Word MFCCs, One Parent per Child

Figure 44: Dlink plots for the Aurora 2.0 corpus, MFCCs, whereQ corresponds to entire words. One parent per childshown. 10ms time frames.

10.1.5 Induced Structures

Using the method of Section 7, we induced conditioning relationships using both MFCCs and AM-FM features. InFigure 44, we show the induced structure for an MFCC system based on whole-word models, and using Q-valuescorresponding to words in the EAR measure. As expected, there is conditioning betweenC0 and its value more than100 ms previously.

In a second set of experiments, we used the AM-FM features as possible conditioning parents for the MFCCs; theinduced conditioning relationships are shown in Figure 46. The first 42 features are the MFCCs; these are followedby AM features, and finally the FM features. This graph indicates that FM features provide significant discriminativeinformation about the MFCCs.

10.1.6 Improved Word Error Rate Results

Table 2 presents the relative improvement in word-error-rate for several structure-induced systems. There are severalthings to note. The first is that significant improvements were obtained in all cases. The second is that structureinduction successfully identified the synergistic information present in the AM-FM features, and resulted in a signif-icant improvement over raw MFCCs. The final point is that when we increased the size of a conventional system tothe same number of parameters, performance was much worse in high noise conditions than in the improved mod-els. Thus, structure induction apparently therefore improves performance in a robust way. These results are furthersummarized in Figure 53.

10.2 Experimental Results on IBM Audio-Visual (AV) Database

The IBM Audio-Visual database was collected at the IBM Thomas J. Watson Research Center before the CLSPsummer workshop in 2000. The database consists of full-face frontal video and audio of 290 subjects, utteringViaVoiceT M training scripts, i.e., continuous read speech with mostly verbalized punctuation, and a vocabulary sizeof approximately 10,500 words. Transcriptions of all 24,325 database utterances, as well as a pronunciation dictionaryare provided. Details about this database can be found in [37].

49

−14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041

Time Lag

Feat

ure

Aurora Whole−Word MFCCs, Two Parents per Child

Figure 45: Dlink plots for the Aurora 2.0 corpus, MFCCs, whereQ corresponds to entire words. Two parents per childshown. 10ms time frames.

clean 20 15 10 5 0 -5

WWS 16.3 19.3 14.2 10.5 9.85 19.0 12.6AMFM 10.4 9.73 6.91 4.29 7.05 17.4 15.5WW 7.16 7.02 5.51 5.93 5.05 16.0 15.0EP 18.9 6.56 14.7 10.7 7.16 5.09 1.20

Table 2: Percent word-error-rate improvement for structure-induced systems. WWS is a system where Q ranges overstates; AMFM conditions MFCCs on AM-FM features; In WW, Q ranges over words; and EP is a straight Gaussiansystem with twice as many Gaussians as the baseline. For the WW and WWS systems, one parent per feature wasused; in the AMFM case, two parents. EP has the same number of parameters as WW and WWS.

10.2.1 Experiment Framework

The audio-visual database has been partitioned into a number of disjoint sets in order to train and evaluate models foraudio-visual ASR. Thetraining set contains 35 hours of data from 239 subjects, and it is used to train all HMMs. Aheld-outdata set of close to 5 hours of data from 25 subjects is used to train HMM parameters relevant to audio-visualdecision fusion. Atestset of 2.5 hours data from 26 subjects is used for testing of the trained models. There are alsodisjoints sets for speaker adaptation and multi-speaker HMM refinement experiments, but since we did not use thosesets in our experiments due to time constraints, we will refer readers to [37] for details.

Sixty-dimensional acoustic feature vectors are extracted for the audio data at a rate of 100 Hz. These featuresare obtained by alinear discriminant analysis(LDA) data projection, applied on a concatenation of nine consecutivefeature frames consisting of a 24-dimensionaldiscrete cosine transform(DCT) of mel-scale filter bank energies. LDAis followed by amaximum likelihood linear transform(MLLT) based data rotation.Cepstral mean subtraction(CMS)and energy normalizationare applied to the DCT features at the utterance level, prior to the LDA/MLLT featureprojection.

In addition to the audio features, visual features are also extracted from the visual data. The visual features consistof a discrete cosine image transform of the subject’s mouth region, followed by an LDA projection and MLLT featurerotation. They have been provided by the IBM participants for the entire database, are of dimension 41, and aresynchronous with the audio features at a rate of 100 Hz.

50

−15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980

Time Lag

Feat

ure

Aurora Whole−Word State Phonetact/MFCCs, Two Parents per Child

Figure 46: Dlink plots for the Aurora 2.0 corpus, Phonetact features, whereQ corresponds to whole-word states. Twoparents per child shown. 10ms time frames.

10.2.2 Baseline

In the baseline audio ASR system, only the audio features are used for training. Context dependent phoneme modelsare used as speech units, and they are modeled with HMMs withGaussian mixtureclass-conditional observationprobabilities. These are trained based on maximum likelihood estimation using embedded training by means of theEM algorithm.

The baseline system was developed using the HTK toolkit version 2.2. The training procedure is similar to theone described in the HTK reference manual. All HMMs had 3 states except the short pause model that had only onestate. A set of 41 phonemes was used. The first training step initializes themonophonemodels with single Gaussiandensities. All means and variances are set to the global means and variances of the training data. Monophones aretrained by embedded re-estimation using the first pronunciation variant in the pronunciation dictionary. A short pausemodel is subsequently added and tied to the center state of the silence model, followed by another 2 iterations ofembedded re-estimation.Forced alignmentis then performed to find the optimal pronunciation in case of multiplepronunciation variants in the dictionary. The resulting transcriptions are used for further 2 iterations of embeddedre-estimation, which ends the training of monophone models.Context dependentphone models are obtained by firstcloning the monophone models into context dependent phone models, followed by 2 training iterations usingtri-phonebased transcriptions.Decision treebased clustering is then performed to cluster phonemes with similar context andto obtain a smaller set of context dependent phonemes. This is followed by 2 training iterations. Finally, Gaussianmixture models are obtained by iteratively splitting the number of mixtures to 2, 4, 8, and 12, and by performing twotraining iterations after each splitting.

The resulting baseline audio-only system performance, obtained by rescoring lattices (i.e.,trigram lattices providedby IBM with log-likelihood value of the trigram language model on each lattice arc), is 14.44%.

10.2.3 IBM AV Experiments in WS’01

In our work during the summer workshop 2001, we used the Graphical Models Toolkit (GMTK) ( [9] and see Sec-tion 6). As described above, GMTK has the ability to rapidly and easily express a wide variety of models and to usethen in as efficient a way as possible for a given model structure. Because of the rich features in the IBM AV database(both audio and video), we think GMTK is a perfect toolkit to study new model structures (rather than just an HMM)for this task.

51

−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 00123456789

10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455

Time Lag

Feat

ure

Spine, Linear MI, Hynek, One Parent per Child

Figure 47: Dlink plots for the SPINE-1 corpus, neural-network features, whereQ corresponds to HMM states. Oneparent per child shown. 10ms time frames.

10.2.4 Experiment Framework

During the summer workshop 2000 [37], the use of orthogonal source — visual features — was investigated underdifferent conditions. However, the study was still under the commonly used HMM framework, regardless of the factthat visual features are very different from audio features. In order to study alternative ways of using visual features,we decided to use the same training/heldout/test data split as the previous workshop (described in the previous section).The same features (60-dimensional audio features, 41-dimensional visual features) were also used in our experiments.5

Therefore, the HMM audio-only system described in the previous section also serves as our baseline.The decoder in GMTK requires a decoding structure in order to perform Viterbi decoding. While the lattices

from workshop 2000 can be converted to the required decoding structure, because of the 6-week time constraints, wefollowed ann-best rescoringbased decoding strategy. Namely, using the lattices generated from a first pass decoding,we first generate n-best lists off line. Subsequently, we rescore the n-best lists using various decoding structures withGMTK. We generated the top 100 best hypotheses for each heldout and test utterance from the corresponding lattice.

10.2.5 Matching Baseline

In order to test GMTK, we first tried to match the HMM baseline. We built a graphical model based system simulatingan HMM as our training structure. All parameters including number of states, number of Gaussian densities per state,the mixture weights, the means and variances, transition probabilities, were all from the HTK model (baseline). Thismodel had 10,738 states, 12 mixtures per state. Then, this model was used to perform n-best rescoring on the testdata. Because the short pause model in the HTK model has only one state which is tied with the middle state of thesilence model (the transition probabilities are not tied), we used state sequences as input for the n-best rescoring. Thetransition probabilities of the short pause state were replaced by the transition probabilities of the middle state of thesilence model. The WER (14.5%) showed that we can simulate an HMM with GMTK and achieve similar results.

10.2.6 GMTK Simulating an HMM

Along with the audio and visual features, we also have a pronunciation lexicon, a word to HMM state sequencemapping, both from IBM. The HMM states are tied states from a decision tree clustering procedure. This decision tree

5Although, we ran out of time at the end of the workshop before we could integrate in the visual features.

52

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

15 10 5 0 -5

SNR

Acc

urac

y

GMTK WORD-STATE (715 Gaussians)

GMTK PHONE-STATE (710 Gaussians)

HTK (546 Gaussians)

Figure 48: Baseline accuracy results on the Aurora 2.0 corpus with MFCC features. This plots compares the wordaccuracies of HTK (green) with that of GMTK (blue and red) at various signal-to-noise ratios (SNRs). GMTK resultsare provided both in the case of whole-word models, shown in blue (the same models that were used for HTK), andfor tied-phone models (shown in red).

is built for context dependent phoneme models and 4 neighboring phones are used as context. There are in total 2,808different states for the 13,302 words in the vocabulary.

We trained HMM models based on the word to state sequence mapping and the training data. At first, 5 iterationsof EM training were carried out assuming single Gaussian density for each state. Then each Gaussian was splitted intotwo, followed by another 5 iterations of EM training. The splitting kept going until we had 16 Gaussian densities ineach state. Finally, Gaussian densities with mixture weight close to zero were merged with the closest densities duringanother 5 iterations of EM training.

The resulting HMM models had 2,808 states and each state had about 16 Gaussian densities. The total number ofGaussian densities was 45,016, which is much smaller than the HTK baseline (128,856 Gaussian densities). On thetest data, this system gave a WER of 17.9%. We believe the difference of this result from the baseline was because offewer number of Gaussian densities. However, due to the time constraint during the workshop, we did not carry outexperiments to match the baseline.

10.2.7 EAR Measure of Audio Features

The HMM models with 2,808 states and 16 Gaussian mixtures per state were used to calculate the discriminativemutual information, or the EAR (explaining-away residual) measure[6], between all possible pairs of conditioningvariables. For expediency and simplicity, only dependencies between observation components (feature components)was computed in our experiments. As described in Section 8, there are three dimensions in visualizing the scalarversion of the EAR measure.

From Figure 35, we can see that there is hardly any information between any two audio feature components westudied. As we recall how the audio features were extracted from the data, a possible reason arises: LDA, as alinear discriminative projection, might be removing the discriminative information between feature components. Asa result, if we look at the audio features alone, it’s almost impossible to find any useful correlation between two

53

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

clean 20 15 10 5 0 -5

SNR

Rel

ativ

e W

ER

Impr

ovem

ent

GMTK WORD-STATE (715 Gaussians)

GMTK PHONE-STATE (710 Gaussians)

HTK (546 Gaussians)

Figure 49: Baseline accuracy results on the Aurora 2.0 corpus with MFCC features. This is essentially the same plotas Figure 48, but shows relative improvements rather than absolute word accuracy

feature components, even though the MI calculations should included a non-linear component of the discriminativeinformation. It appears, however, in this case that the linear discriminant transformation was sufficient to remove this.This hypothesis, of course, should be more throughly verified, i.e., if an LDA transformation on multiple-windows (9in this case) of feature vectors will remove the potential for discriminative cross-observation linear dependencies.

The visual features, on the other hand, were extracted and processed independent of the audio features. Thereshould be more information between a visual feature component and an audio feature components. Therefore, a plot ofEAR measure including visual features is very desirable. Unfortunately, we could not compute such an EAR measurebefore the end of the workshop, due to the high computation needs from our AURORA experiments. However, becauseof the flexibility of the GMTK, we believe once the computation is done, we can quickly induce interesting structuresand experiment with the new discriminative models.

10.3 Experimental Results on SPINE-1: Hidden Noise Variables

We ran several additional experiments on the SPINE-1 speech in noisy environments data-base using a noise-clusteringmethod as described in Figure 50. Again, the noise variable was observed during training and hidden during recogni-tion, and the various types of SPINE noises were clustered together at different degrees of granularity. From the trainedGMTK-based models, we generated an n-best list from a separate HTK-based ASR system, and rescored the n-bestlist using GMTK. The results are shown in Figure 54. As can be seen, the WER gets worse as the number of noiseclusters increases. Given this result, and the noise-variable results above for Aurora 2.0, it appears that simply addinga noise variable in a graphical model without concentrating on discriminability is not guaranteed to help (and is likelyto hurt) performance. Due to time constraints, in the 6 weeks of the workshop, we did not attempt any experimentsthat tried to optimize discriminative structure on SPINE-1 as would be suggested by Figure 47.

54

Noise Condition

Observations

Phone

Position

Transition

Figure 50: A simple noise model with a single noise clustering hidden variable at each time step.

11 Articulatory Modeling with GMTK

Another one goal of the graphical modeling team was to apply GMTK to articulatory modeling, or the modeling ofthe motion of the speech articulators, either in addition to or instead of phones. The goal ultimately was to use thearticulatory structure as a starting point with which produce a more discriminative structure. Due to time limitations,however, during the workshop we focused only on the base articulatory structure.

We use the termarticulatory modelingto mean either the explicit modeling of particular articulators (tongue tip,lips, jaw, etc.) or the indirect modeling of articulators through variables such asmanneror place of articulation. Themotivations for using such a model come from both linguistic considerations and experiments in speech technology.

On the linguistics side, current theories of phonology, referred to asautosegmental phonology[47], hold thatspeech is produced not from a single stream of phones but frommultiplestreams, or tiers, of linguisticfeatures. Thesefeatures (not to be confused with the same term in pattern recognition or acoustic features in an ASR system) canevolve asynchronously and do not necessarily line up to form phonetic segments. Autosegmental phonology does notprovide a specific set of tiers, but various authors have posited that the set includes features of tone, duration, andarticulation.

On the speech technology side, there is mounting evidence that a phone-based model of speech is inadequatefor recognition, especially for spontaneous, conversational speech [83]. Researchers have noted, for example, thedifficulty of phonetically transcribing conversational speech, and especially of locating the boundaries betweenphones [39]. Furthermore, while it has been observed that pronunciation variability accounts for a large part ofthe performance degradation on conversational speech [50, 74, 75], efforts to model this variability with phone-basedrules or additional pronunciations have had very limited success [74, 91]. One possible reason for this is that phonemesaffected by pronunciation rules often assume a surface form intermediate to the underlying phonemes and the surfacephones predicted by the rules [93]. Such intermediate forms may be better represented as changes in one or more ofthe articulatory features.

There have been several previous efforts to use articulatory models for speech recognition. One such effort hasbeen by Deng and his colleagues (e.g., [31, 32]). In their experiments, Denget al.use HMMs in which each state cor-responds to a different vector of articulatory feature values. Their experiments explore different ways of constructingthis state space using linguistic and physical constraints on articulator evolution, as well as different ways of modelingthe HMM observation distributions. A similar model was used by Richardsonet al. [88, 89]. Kirchhoff [65] usedneural networks to extract articulatory features and then mapped these values to words, and in at least one case [64]allowed the articulators to remain unsynchronized except at syllable boundaries. There have been other attempts to usefeatures at the lower levels of recognition, typically by first extracting articulatory feature values using neural networksor statistical models, and then using these values instead of or in addition to the acoustic observations for phoneticclassification [35, 80].

One difficulty in using articulatory models for speech recognition is that the most commonly used computationalstructures (hidden Markov models, finite-state transducers) allow for only one hidden state variable at a time, whereasarticulatory models involve several variables, one for each articulatory feature. While it is possible to implementarticulatory models with single-state architectures by encoding every combination of values as a separate state (as

55

-10.00%

-8.00%

-6.00%

-4.00%

-2.00%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

clean 20 15 10 5 0 -5

SNR

Rel

ativ

e W

ER

Impr

ovem

ent

Phone Noise-Clustered

Baseline Phone

Figure 51: Baseline phone-based GMTK numbers compared to a simple model that uses noise clustering (i.e., includesa hidden variable for different types of noise, as seen in Figure 50)

in [31, 88, 89]), the mapping from the inherent multi-variable structure to the single-variable encoding can be cumber-some and the resulting model difficult to interpret and manipulate.

Graphical models (GMs) are therefore a natural tool for investigating articulatory modeling. Since they allow foran arbitrary number of variables and arbitrary dependencies between them, the specification of articulatory models interms of GMs is fairly direct. The resulting models are easy to interpret and modify, allowing for much more rapidprototyping of different model variants.

The next section gives some background on articulatory modeling and our articulatory feature set. We then describethe progress made during the workshop, including the construction of a simple articulatory graphical model.

11.1 Articulatory Models of Speech

Figure 55 shows a diagram of the vocal tract with the major articulators labeled. The ones we are most concerned withare the glottis, velum, tongue, and lips. The glottis is the opening between the vocal folds (or vocal cords), which mayvibrate to create a voiced (or pitched) sound or remain spread apart to create a voiceless sound. The position of thevelum determines how much of the airflow goes to the oral cavity and how much to the nasal cavity; if it is lowered soas to block airflow to the oral cavity, a nasal sound (such as [m] or [n]) is produced. The positions of the tongue andlips affect the shape of the oral cavity and therefore the spectral envelope of the output sound. Articulatory featureskeep track of the positions of the articulators over time.

11.2 The Articulatory Feature Set

We can imagine a large variety of feature sets that could be used to represent the evolution of the vocal tract duringspeech. For example, in [22], Chomsky and Halle advocate a system of binary features. Some of these features aremore physically-based, such asvoiced, and some are more abstract, such astense. Other speech scientists, such asGoldstein and Browman [15], advocate more physically-based, continuous-valued features such aslip constrictionlocation/degreeandvelum constriction degree.

56

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

SNR

Accu

racy

MFCC

MFCC_F

MFCC_A-F

RAW_FM

RAW_AM

RAW_AM_FM

RAW_PT

clean 20 15 10 5 0 -5

Figure 52: Absolute results on Aurora 2.0 at different SNRs using GMTK with different features sets consisting ofeither just MFCCs (the blue plot) or feature vectors consisting of MFCCs augmented with other features (so largerfeature vectors). MFCC = just MFCCs, MFCCF = MFCCs + FM features, MFCCA-F = MFCCs + AM/FM features,RAW FM = just raw un-preprocessed FM features (i.e., no deltas, double deltas, cosine transform, smoothing, meannormalization, etc.), RAWAM = just raw un-preprocessed AM features, RAWAM FM = raw AM and FM featurestogether.

Unfortunately we are not aware of any well-established, complete feature sets in the speech science literature thatare well-suited to our task. As has been done in previous work on articulatory-based recognition [65, 31], we havedrawn on existing feature sets to construct one that is appropriate for our purposes. The set we have used to date isshown in Table 3. It was based on considerations such as state space size and coverage of the phone set used in ourAurora experiments. It does have some drawbacks, however, such as the lack of a representation of the relationshipbetween certain vowel features and certain consonant features; for example, in order to model the fact that an alveolarconsonant can cause the fronting of adjacent vowels, we would need to include a dependency betweenplace andtongueBodyLowHigh.

11.3 Representing Speech with Articulatory Features

The distinctions between describing speech in terms of phones and describing it in terms of articulatory feature streamscan be seen through some examples.

First consider the case of the wordwarmth. The two ways of representing the canonical pronunciation of thisword—as a string of phones and as parallel strings of features—are shown in Figure 56. In the articulatory represen-tation, if a speaker goes through the parallel streams synchronously, he will produce the same phone string as in thephone-based representation. If, however, the features are not perfectly synchronized, this may produce some otherphones, or some sounds that are not in the standard phone inventory of English at all. The same can occur if thefeatures remain synchronized, but do not reach their target values at some points.

To see the possible advantage of the feature-based approach, consider the part of the word where the speaker istransitioning from the [m] to the [th]. The articulators must perform the following tasks: the velum must rise from

57

-50.00%

-40.00%

-30.00%

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

clean 20 15 10 5 0 -5

SNR

Rel

ativ

e W

ER

Impr

ovem

ent

WWS_DQ_1MAX

WW_Phonetact_DQ_2MAX

WW_DQ_1MAX

WWS_DQ_VAN_1MAX

GMTK Word Baseline (2x)

GMTK Word Baseline

Figure 53: Summary of the relative improvements on the Aurora 2.0 corpus. WWSDQ 1MAX = Aurora 2.0 withwhole word state (i.e.,Q = whole-word states) using the edges selected by taking the single best according to theEAR measure. WWPhonetactDQ 2MAX = results where the AM/FM features are conditional random variables,and where the top two edges according to the EAR measure are chosen. WWDQ 1MAX = the whole models (Qin the EAR measure are the words) choosing the one best edge. WWSDQ VAN 1MAX = results similar to theWWS DQ 1MAX case except the GMTK vanishing ratio was set so that the total number of final parameters wasthe same as the HMM baseline. GMTK Word Baseline (2x) is the GMTK HMM baseline with twice the number ofGaussians (this model had 4/3 times the number of parameters as the augmented structure models). Finally, GMTKWord Baseline is the standard GMTK-based HMM baseline

its position for the nasal [m] to the non-nasal [th]; the vocal folds must stop vibrating; and the lips must part andthe tongue tip move into an interdental position for the [th]. If all of the articulators move synchronously, then thephones [m] and [th] are produced as expected. However, the articulators may reach their target positions for the [th] atdifferent times. One common occurrence in the production of this word is that the velum may be raised and voicingturned off before the lips part. In that case, there is an intermediate period during which a [p]-like sound is produced,and the uttered word then sounds likewarmpth. One way to describe this within a phone-based representation is tosay that the phone [p] has been inserted. However, this does not express our knowledge of the underlying process thatresulted in this sound sequence. Furthermore, a [p] produced in this way is likely to be different from an intentional[p], for example by having a shorter duration.

Another example is the phenomenon of vowel nasalization, which can occur when a nasal sound follows a vowel(as in hand). If the velum is prematurely lowered, then the end of the vowel takes on a nasal quality. A phone-based description of this requires that we posit the existence of a new phone, namely the nasalized vowel. If we wishto express the fact that only the latter part of the vowel is nasalized, we need to represent the vowel segment as asequence of two phones, a non-nasalized vowel and a nasalized one. A similar example is the early devoicing of aphrase-final voiced consonant, which would again require description in terms of two phones, one for the voiced partand one for the voiceless part.

58

33.50%

34.00%

34.50%

35.00%

35.50%

36.00%

36.50%

37.00%

37.50%

38.00%

38.50%

WER

0 3 6

Number of Clusters

Figure 54: Results on SPINE 1.0 with various degrees of noise-type clusters. As the number of clusters increases, theerror gets worse.

Feature name Allowed values Comments

voicing off, on “off” refers to voiceless sounds, “on” to voiced soundsvelum closed, open “closed” refers to nasal sounds, “open” to non-nasal soundsmanner closure, sonorant, fricative, burst “closure” refers to a complete closure of the vocal tract,

e.g. the beginning part of a stop; “burst” refers to theturbulent region at the end of a stop.

place labial, labio-dental, dental, alveolar, The location of the oral constriction for consonant sounds;post-alveolar, velar, nil “nil” is used for vowels

retroflex off, on “on” refers to retroflexed soundstongueBodyLowHigh low, mid-low, mid-high, high, nil Height of the tongue body for vowels; “nil” is used for

consonants.tongueBodyBackFront back, mid, front, nil Horizontal location of the tongue body for vowels; “nil” is

used for consonants.rounding off, on “on” refers to rounded sounds

Table 3:The feature set used in our experiments.

11.4 Articulatory Graphical Models for Automatic SpeechRecognition: Workshop Progress

Figure 57 shows the structure of an articulator-based graphical model developed during the workshop. Each of thearticulatory variables can depend on its own value in the previous frame, as well as on the current phone state. Thedependency on the previous frame is intended to model continuity and inertia constraints on feature values. Forexample, a feature is likely to retain the same value over multiple frames rather than jump around, and most featurescannot go from one value to a very different one without going through the intermediate value(s) (e.g.tongue heightcannot go from “low” to “high” without going through “mid”). We also constructed an alternate version of the model,in which the articulatory variables also depend on each other’s values in the current frame; these are not shown inthe figure for clarity of presentation. The switching dependency from the phone state to the observation is used onlyfor special handling of silence: if the current frame is a silence frame (i.e. the phone state variable is in one of thesilence states), the observation depends only on the phone; otherwise, the observation depends only on the articulatoryvariables. This was done in order to avoid assigning specific articulatory values to silence.

The observation variable’s conditional probability is implemented via a Gaussian mixture for each allowed com-bination of values, plus additional Gaussian mixture models for the silence states. All of the other variables are

59

Figure 55: A midsagittal section showing the major articulators of the vocal tract. Reproduced fromhttp://www.ling.upenn.edu/courses/Spring2001/ling001/phonetics.html.

discrete with discrete parents, so their conditional probabilities are given by (multidimensional) probability tables.The probability tables for the phone variable are constructed identically to the 3-state phone-based model we used inour phone-based Aurora experiments. The probability tables of the articulatory variables determine the extent to whichthey can stray from their canonical values for each phone state and the extent to which they depend on their past values.For example, we can make the articulatory variables depend deterministically on the phone state, by constructing theirprobability tables such that a zero probability is given to all values except the canonical value for the current phonestate; the model then becomes equivalent to our phone-based recognizer. We in fact ran this experiment as a sanitycheck.

For the general case of nondeterministic articulatory values, we needed to construct reasonable initial settings forthe probability tables of the articulatory variables, while avoiding entering each initial probability value by hand. Thefollowing procedure was therefore used. We first defined a table of allowed values for each articulator in each phonestate, with a probability for each allowed value, as shown in Table 4. We then defined a table of transition probabilitiesfor each articulator given its previous value, as shown in Table 5. The final probability table for each articulatoryvariable, representing the probability of each of its possible values given its last value and the current phone, wasconstructed by multiplying the appropriate entries in the two tables and normalizing to ensure that the probabilitiessum to one.

Because of the large memory requirements of the model, we were unable to run experiments with this model forcases where the articulators are not constrained to their canonical values. Toward the end of the workshop, log-spaceinference became available in GMTK, making it possible to trade off memory for running time. We therefore leaveexperiments with this model to future investigation.

Our main achievements during the workshop were in creating the infrastructure to construct various versions ofthis articulatory model. This includes:

60

phones w ao r m th

voicing on offvelum cl op clmanner son fricplace lab nil post-alv lab denttongueBodyLowHigh high low mid-low niltongueBodyBackFront back mid nilrounding on off

Figure 56: Two ways of representing the pronunciation of the wordwarmth. The top line shows the phone-basedrepresentation; the remaining lines show the different streams in the feature-based representation, using the featuresand (abbreviated) feature values defined in Table 3.

wordtrans

a 1 a 2a N

wordtrans

wordtrans

a 1 a 2 a N a 1 a 2a N

phonetrans

phonestate

phonetrans

phonestate

phonestate

phonetrans

frame i-1 frame i

. . .

word

frame i+1

pos

. . .

word

pos

. . .

word

pos

O O O

Figure 57:An articulatory graphical model. The structure from the phone state and above is identical to the phone-based recognizer. The articulatory variables are denoteda1, . . . , aN , and the observation variable is denotedo. Thespecial dependencies for the last frame are not shown but are identical to those in the phone-based model.

• Scripts to construct structures of the type in Figure 57 for arbitrary definitions of the articulatory feature set,phone set, and mappings from phones to articulatory values, for structures with and without inter-articulatordependencies in the current frame.

• Scripts to generate initial conditional probability tables from phone-to-articulator mapping tables and articula-tory value transition tables, as described above

• Scripts to generate an initial Gaussian mixture for each combination of articulatory values. We used a simpleheuristic: if the combination corresponds to the canonical production of some phone state, use an existingGaussian mixture for that phone state (from a prior training run of the phone-based recognizer); otherwise, usea silence model.

• The tables necessary to construct several variants of the model, including one in which the articulators mustalways take on their canonical values; one in which the articulators must reach the canonical values in themiddle state of each phone but can stray from those values in the first and third state; and one in which thearticulators need never take on the canonical values but are constrained to fall within a given range of thosevalues.

61

Phone state voicing velum ...

ah0 on (0.9), off (0.1) closed (0.2), open (0.8) ...n0 on (1.0) closed (1.0) ...... ... ... ...

Table 4:Part of the table mapping phone states to articulatory values. According to this table, the first state of[ah]has a probability of 0.9 of being voiced and a probability of 0.1 of being voiceless, and it has a probability of 0.2 ofbeing nasalized; an[n] must be voiced and nasal; and so on.

Value in previous frame Pr(value = 0) Pr(value = 1)

0 0.8 0.21 0.2 0.8

Table 5: The transition probabilities for the voicing variable. This particular setting says that the voicing variablehas a probability of 0.8 of remaining in the same value as in the previous frame and a probability of 0.2 of changingvalues (regardless of the value in the previous frame).

12 Other miscellaneous workshop accomplishments

12.1 GMTK Parallel Training Facilities

This section documents the scripts that we developed to run GMTK tasks on multiple machines in parallel. The parallelscripts are written in thebash shell language and use thepmake (parallel make) utility to run distributed jobs. Thescripts work by reading in a user-created “header” file defining various parameters, creating the appropriate makefiles,and then runningpmake (possibly multiple times if training multiple iterations). We developed parallel scripts to (1)run EM training for a Gaussian mixture-based model with a given training “schedule” (as described in Section 12.1.1below), and (2) create Viterbi alignments for a given set of utterances. We also usedpmake to decode multiple testsets in parallel; for this purpose we did not write a separate script, but rather created makefiles by hand.

12.1.1 Parallel Training: emtrain parallel

In order to train GMTK models in parallel, we divide the training set into a number of chunks. During each EMiteration, the statistics of each chunk are first computed using the current model parameters. At the end of the iteration,the statistics from all of the chunks are collected in order to update the model parameters. Below we describe thetraining procedure in greater detail.The parallel training scriptemtrain parallel is invoked via

emtrain_parallel [header_filename]

where the header file is itself abash script containing parameter definitions of the form

$PARAMETER_NAME=value

The main parameters defined in the header file are:

• The files in which the training data, initial model parameters, and training structure are found, the file to whichthe final learned parameters are to be saved, and a directory for temporary files.

• A templatemasterfile from which multiple masterfiles will be created, one for each chunk of utterances.

• A training scheduledefined by two arrays of equal lengthN , one for the-mcvr parameter and one for the-mcsr parameter ofgmtkEMtrain . These define the vanishing and splitting ratios, respectively, for the firstN iterations of training. After theN th iteration, additional iterations are done until a convergence threshold(defined below) is reached, using the last-mcsr and-mcvr in the arrays.

• The location of a script, which must be provided by the user, to create all the necessary utterance-dependenctdecision tree files for a given chunk of utterances.

62

• The iteration numberI from which to start, which can be anywhere from 1 to the last iteration in the mcvr/mcsrarrays. IfI = 1, the initial trainable parameters file will be used. IfI > 1, then training will start from theIth

iteration using the learned parameters from the(I − 1)th iteration. The latter case is meant to be used to restarta training run that has been halted before completion for some reason.

• The maximum numberM of iterations to run, and the log-likelihood ratio threshold for convergence. Trainingwill proceed until convergence or forM iterations, whichever comes first.

• The number of chunks to break the training data into, and the maximum number of chunks to be run in parallelat one time.

These and all other parameters are described in greater detail in an example header file in Appendix 12.1.3.The basic procedure thatemtrain parallel follows is:

0. DefineN = number of chunks to break training data into,I = initial EM iteration,M = maximum number ofEM iterations,rt = log-likelihood ratio threshold for convergence,INITIAL GMP= initial trainable parametersfile.

1. Divide the training set into theN chunks, and create a separate masterfile and decision tree (DT) files for eachchunk of utterances. This step is done in parallel usingpmake, as there may be a large number of masterfilesand DT files to create.

2. For iter = I to length of mcsr/mcvr arrays, do:

(a) If iter = 1, set current parameters file toINITIAL GMP. Otherwise, set it to the learned parameters filefrom iterationiter − 1.

(b) Create apmake makefile withN targets, each of which stores the statistics of one chunk of the trainingdata (usinggmtkEMtrain with -storeAccFile ). Runpmake using this makefile, storing all outputto a log file.

(c) Collect the statistics from all of the accumulator files for this iteration (usinggmtkEMtrain with-loadAccFile ) and update the model parameters, using the current-mcsr and -mcvr to split orvanish Gaussians as appropriate.

3. Repeat until convergence or until theM th iteration, whichever comes first:

(a) Incrementiter.

(b) Follow (a) – (c) from (2) above.

(c) Test for convergence: lettingLi be the log likelihood of the training data in iterationi, compute the currentlog-likelihood ratior as

r = 100 · Liter − Liter−1

|Liter−1|(7)

Convergence has been reached ifr < rt.

We also include here several notes that we have found helpful to keep in mind when running this script:

• The last-mcsr and -mcvr values in the training schedule should be such that no splitting or vanishing isallowed. This is so that, during the “convergence phase” of the training run, successive iterations of the modelhave the same number of Gaussians and the log likelihoods can be compared (since, in EM training, the loglikelihood is guaranteed to increase with each iteration only if the number of Gaussians is kept constant).

• During an iteration ofgmtkEMtrain , some utterances may be skipped if their probability is too small accord-ing to the current model (and with the current beam width). If different utterances are skipped in successiveiterations, then the log likelihoods of those two iterations are again not strictly comparable since they are com-puted on different data. This can become a serious problem if a significant number of utterances is skipped. Thescript does not warn the user about skipped utterances, butgmtkEMtrain does output warnings, which can befound in thepmake log files.

63

12.1.2 Parallel Viterbi Alignment: Viterbi align parallel

To create Viterbi alignments with GMTK,gmtkViterbi is run using the training structure instead of the decodingstructure, and the values of the desired variables in each frame are written out into files, one per utterance (using the-dumpNames option).The parallel Viterbi alignment scriptViterbi align parallel is invoked via

Viterbi_align_parallel [header_filename]

where the header file is again a listing of parameter definitions.The main parameters defined in the header file are:

• Filenames for the observations of utterances to be aligned, model parameters, and training structure, directoriesin which to put temporary files and output alignments, and a filestem for the alignment filenames.

• A template masterfile and a script to create chunk DT files, as inemtrain parallel .

• The number of chunks to break the observations into, and the maximum number of chunks to be run at once inparallel.

• A file containing the names of the variables to be stored in the alignments.

The procedure for parallel Viterbi alignment is much simpler than for parallel training. The script simply dividesthe data into the specified number of chunks, creates lists of output alignment filenames for each chunk, creates amakefile that runsgmtkViterbi with the appropriate parameters for each chunk, and runspmake.

A similar procedure could be used to do parallel Viterbidecodingof a test set by dividing the set into chunks andthen collecting the chunk outputs, although we did not do this during the workshop.

12.1.3 Exampleemtrain parallel header file

############################################################ Example##header file for use with emtrain_parallel############################################################

############################################################## files & directories############################################################

## File in which training data is stored (this can be either## a pfile or a file containing a list of feature files)TRAIN_FILE=/export/ws01grmo/aurora/training_pfiles/mfcc/MultiTrain.pfile

## Initial trainable parameters fileINITIAL_GMP=/your/parameters/dir/initial_params.gmp

## Output file in which to put _final_ learned parametersEMOUT_FILE=/your/parameters/dir/learned_params.gmp

## Structure file for trainingSTRFILE=/your/parameters/dir/aurora_training.str

## Directory in which to put temporary files (makefiles,## pmake output, chunk decision tree (DT) files, chunk## masterfiles, and learned parameters from intermediate## iterations)MISC_DIR=/your/temporary/dir/

64

## Template masterfile: like a regular masterfile, except## that wherever an utterance-dependent DT file is specified## in the template masterfile, the file name must end in the## string "*RANGE*.dts". Also note that, in the template## masterfile, the directory of the chunk DT files must be## $MISC_DIR.MASTER_FILE=/your/parameters/dir/masterFile.template.params

############################################################## other params for gmtkEMtrain############################################################

## The training schedule:#### Arrays of values for the -mcvr and -mcsr parameters, one## per iteration up to the last iteration before## log-likelihood-based training takes over. After the## last iteration specified in the arrays, training will## continue until convergence (or until iteration## $MAX_EM_ITER) using the last -mcvr and -mcsr values in## the arrays.#### The schedule being used here is:## -Run 1 iteration with no splitting or vanishing## -Run 2 iterations in which all Gaussians are split but## none are vanished## -Run 1 iteration with no splitting or vanishing## -Continue training until convergence

MCVR_ARRAY="1e20 1e20 1e20 1e20"

MCSR_ARRAY="1e10 1e-15 1e-15 1e10"

## Value for -meanCloneSTDfrac -- assumed to be the same## for all iterationsMEANCLONEFRAC=0.25

## Value for -covarCloneSTDfrac -- assumed to be same for## all iterationsVARCLONEFRAC=0.0

## Variables specifying all parameters relevant to each## of the 3 feature streams (-of, -nf, -ni, -fmt, -iswp).## --If using fewer than 3 streams, use null string for## the remaining stream(s).## --Must have non-null value for at least one of the## streams.STREAM1_PARAMS="-of1 $TRAIN_FILE -nf1 42 -ni1 0 -fmt1 pfile -iswp1 true"STREAM2_PARAMS=""STREAM3_PARAMS=""

############################################################

65

## other parameters for emtrain_parallel############################################################

## User provides a script that generates the DT files for a## given chunk of utterances.## --The script must write the chunk DTs to files ending## in <range>.dts, where range is of the form min-utt:max-utt.## --The script can take any number of arguments, but the## last two must be an utterance range (in the form## min-utt:max-utt) and a directory in which to put DT files.## Any other arguments must be included here with the script## name. In the example below, "generate_chunk_dts" takes## as arguments a label file and then the utterance range## and directory.LABEL_FILE=/export/ws01grmo/aurora/LABELFILES/AllMultiTr.mlfGENERATE_CHUNK_DTS="/home/ws01/klivescu/GM/aurora/phone/generate_chunk_dts ${LABEL_FILE}"

## Number of training sentences; must be <= the number of## utterances in $TRAIN_FILE (If < , then the first## $NUM_TRN_SENTS utterances will be used for training.)NUM_TRN_SENTS=8440

## Number of EM iterations;## -- if <= the number of elements in $MCVR_ARRAY and## $MCSR_ARRAY, then training will follow the schedule## in the arrays up to $MAX_EM_ITER## -- if > number of elements, then training will follow the## schedule through the last element of the arrays, and## then will continue until iter $MAX_EM_ITER or until log## likelihood threshold reached, whichever comes firstMAX_EM_ITER=100 # this means the schedule above will be completed,

# then training will continue for at most another# 100-4 iterations or until convergence

## EM iteration number to start from (between 1 and the last## iteration in the MCVR and MCSR arrays). If > 1, then## emtrain_parallel will look for the learned params file from## the previous iteration, $MISC_DIR/learned_params[k].gmp,## where k = $INIT_EM_ITER-1. An error will be generated if## this file doesn’t exist.INIT_EM_ITER=1

## Log-likelihood (LL) difference ratio, in percent, at which## convergence is assumed to have occurredLOG_LIKE_THRESH=0.2 # i.e. train until the LL difference

# between the current iteration and the# previous one is 0.2% or less.

## Binary parameter indicating whether or not to keep the## accumulator files from each iteration. The default value## is ‘‘true’’. If ‘‘false’’, then each iteration will## overwrite the accumulator files from the last iteration.KEEP_ACC=‘‘true’’

66

## Number of chunks to divide training data into--should be a## multiple of EMTRAIN_PARALLELISM for maximum time-efficiencyEMTRAIN_CHUNKS=20

## Maximum number of processes to run in parallel at any timeEMTRAIN_PARALLELISM=20

## Number of processes to run on the local machineNUM_LOCAL=0

## Set of nodes on which to run, in pmake syntaxNODES="delta grmo OR alta grmo"

## Specify the binary for gmtkEMtrainGMTKEMTRAIN=/export/ws01grmo/gmtk/linux/bin/gmtkEMtrain.WedAug01_19_2001

12.1.4 ExampleViterbi align parallel header file

############################################################## Example header file for use with Viterbi_align_parallel############################################################

############################################################## directories############################################################

## Directory in which to put temporary files (makefiles,## pmake output, chunk decision tree (DT) files, and chunk## masterfiles)MISC_DIR=/your/temporary/dir/

## Directory in which to put alignmentsALIGN_DIR=/your/alignments/dir

## Filestem for alignment files. Output alignment files## will be of the form $ALIGN_FILESTEM.utt_[num].outALIGN_FILESTEM=align

############################################################## training & parameter files############################################################

## File in which training data is stored (this can be either## a pfile or a file containing a list of feature files)TRAIN_FILE=/export/ws01grmo/aurora/training_pfiles/mfcc/MultiTrain.pfile

## Template masterfile: like a regular masterfile, except## that wherever an utterance-dependent DT file is specified## in the template masterfile, the file name must end in the## string "*RANGE*.dts". Also note that, in the template## masterfile, the directory of the chunk DT files must be## $MISC_DIR.

67

MASTER_FILE=/your/parameters/dir/masterFile.template.params

## Structure file for trainingSTRFILE=$PARAMS_DIR/aurora_training.str

## Trainable parameters fileTRAINABLE_PARAMS_FILE=/your/paramters/dir/params.gmp

############################################################## params for gmtkViterbi############################################################

## Specify the binary to use for gmtkViterbiGMTKVITERBI=/export/ws01grmo/gmtk/linux/bin/gmtkViterbi.ThuJul26_23_2001

## Variables specifying all parameters relevant to each## of the 3 feature streams (-of, -nf, -ni, -fmt, -iswp).## --If using fewer than 3 streams, use null string for## the remaining stream(s).## --Must have non-null value for at least one of the## streams.STREAM1_PARAMS="-of1 $TRAIN_FILE -nf1 42 -ni1 0 -fmt1 pfile -iswp1 true"STREAM2_PARAMS=""STREAM3_PARAMS=""

## Any other params you want to pass to gmtkViterbiMISC_PARAMS=""

############################################################## other params############################################################

## User provides a script that generates the DT files for a## given chunk of utterances.## --The script must write the chunk DTs to files ending## in <range>.dts, where range is of the form min-utt:max-utt.## --The script can take any number of arguments, but the## last two must be an utterance range (in the form## min-utt:max-utt) and a directory in which to put DT files.## Any other arguments must be included here with the script## name. In the example below, "generate_chunk_dts" takes## as arguments a label file and then the utterance range## and directory.LABEL_FILE=/export/ws01grmo/aurora/LABELFILES/AllMultiTr.mlfGENERATE_CHUNK_DTS="/home/ws01/klivescu/GM/aurora/phone/generate_chunk_dts ${LABEL_FILE}"

## File listing names of variables to dump out into alignmentsDUMP_NAMES_FILE=/your/work/dir/dump_names

## Number of training sentences; must be <= the number of## utterances in $TRAIN_FILE (If < , then the first## $NUM_TRN_SENTS utterances will be used for training.)

68

NUM_TRN_SENTS=8440

## Max number of processes to run at oncePARALLELISM=20

## Number of chunks to divide training data into--should be a## multiple of PARALLELISMCHUNKS=20

## Number of processes to run locallyNUM_LOCAL=0

## Set of nodes on which to runNODES="delta grmo OR alta grmo"

12.2 The Mutual Information Toolkit

Our approach to structure learning is to start from a base model (the HMM) and decide how we can improve thestructure (by addition or removal of edges) to make it more discriminative and better at classification. For that weneed a way to evaluate the effect on the quality of the structure of adding (or removing) edges from the graph and thisis where theMI Toolkit enters into play. The toolkit provides a set of tools designed to calculate mutual informationbetween nodes in the graphical model. This measure is used to decide where changes to the structure will have mosteffect.

We present the rest of this toolkit overview in the context of speech recognition even though the tools are generalenough to be applied to any time-varying series. Specifically we assume data is presented in the form ofsentences,which are collections offramesor vectors. One problem that this toolkit solves is that of processing very large amountsof data, which cannot fit into memory. At any given time, only one sentence has to be loaded and processed. Moreover,the tools are designed to be run in parallel.

Besides the data, an input to the MI tools are the specifications of the relative positions of the features in the speechframes, between which we want to compute the MI. At any given time/frame, such a specification defines two vectorsX andY . By going over the data, we collect instances of vectorsX andY . Section 12.2.2 discusses how the jointprobability distributions are estimated from those instances. Section 12.2.1 starts by introducing background aboutmutual information and entropy that is useful for the rest of the discussion.

12.2.1 Mutual Information and Entropy

Mutual information is the amount of information a given random variableX has about another random variableY .Formally,

I(X;Y ) = E[logp(X, Y )

p(X)p(Y )]

Mutual information is 0 when X and Y are independent (i.e.p(X, Y ) = p(X)p(Y )) and is maximal when X andY are completely dependent i.e. there is a deterministic relationship between them. The value ofI(X;Y ) in that caseis min{H(X),H(Y )}, whereH(X) is the entropy ofX and is defined as

H(X) = E[log(1

log(X))]

It measures the amount of uncertainty associated withX.

I(X;Y ) = E[logp(X, Y )

p(X)p(Y )]

= E[logp(X|Y )P (X)

]

= H(X)−H(X|Y )

69

By symmetryI(X;Y ) is also equal toH(Y )−H(Y |X), hence we get the upper bound since entropy is positive.The definition of the mutual information applies for any random vectorsX andY , but we make the distinction

between the bivariate mutual information, whenX andY are scalars and multivariate mutual information, whenXandY are vectors.

12.2.2 Toolkit Description

The MI Toolkit consists of four programs:

1. Discrete-mi

2. Bivariate-mi

3. Multivariate-mi

4. Conditional-entropy

Discrete-micalculates the MI when the vectors are discrete i.e. each component can take a finite number of values.There are no restrictions on the size of the vectors other than memory size limitations (a hash table version of this toolhas also been written to alleviate the memory problem when vectors have sparse values).

Bivariate-micalculates the MI between two continuous scalar elements. The restriction to scalar allows the use ofseveral optimizations that considerably speed up MI calculation.

Multivariate-migeneralizes the Bivariate-mi tool to arbitrary sized vectors.Conditional-entropycalculates the conditional (on frame labels) entropy of arbitrary sized vectors.There are two main parts to calculating the mutual information (or entropy). First, the joint probability distribution

must be estimated. Then, after obtaining the marginals, the MI is estimated.Calculating the mutual information when the joint and marginal probability distributions are available is done as

shown in section 12.2.1 (definition of MI).ForDiscrete-miobtaining the probability distribution is straight-forward: the probability of each n-tuple is just the

frequency at which the n-dimensional configuration appears.For the remaining programs, computing the probability distribution is more involved and relies on an Expectation

Maximization procedure. Following is a description of how the mutual information between two random vectorsXandY is calculated using EM.

12.2.3 EM for MI estimation

Training data is partitioned into “sentences,” each of which contains, depending on the length of the utterance, severalhundred frames, or vectors of observations.

Besides the data, an input to the MI procedure are the specifications of the relative positions of the features in thespeech frames, between which we want to compute the MI.

1. While EM convergence is not reached (i.e. the increase in log likelihood is above a given threshold),

2. Read a new sentence in.

3. For each position/lag specification populate a arrays of vectors X and Y by collecting the specified features overthe sentence.

4. Accumulate sufficient statistics.

5. Go to 2, until all sentences are read

6. Update the parameters of the joint probability distributionpXY .

7. Partition the parameters of thepXY distribution to getpX andpY .

8. Get samples from the above three distributions and estimate1N

∑Ni=1 log pXY (xi,yi)

pX(xi)pY (yi)

The quantity that is estimated in the last step approximates the mutual information, by the law of large numbers.The larger theN the better the approximation. The sampling from the distributions can either be done directly fromthe data or by generating new samples according to the learned distributions. Both methods yield similar results.

70

12.2.4 Conditional entropy

The previous MI procedure assumes both random variables are continuous, but often we are interested in the mutualinformation between a continuous and a discrete variable i.e. we want to calculateI(X, A) whereX is continuousand A discrete. Such a calculation is needed, for example when we want to augment our graphical model withconditioning featuresY , observations that do not depend on the stateQ as it is the case for normal observations (whichwe also callscoring observation), but that can potentially help discrimination. A simple criterion to select conditioningobservations is, thus, to compute the unconditional mutual information between the conditioning features and the state.If that value is small, we deduce thatY⊥⊥Q. We also computeI(Y,Q|X) to verify thatX depends onY .

We can write the mutual information as a function of entropy and we get

I(X, A) = H(X)−H(X|A)

= Ep[1

log(X)]− Ep(x,a)[

1log p(X|A)

]

= −Ep[log(X)] +∑ai

p(ai)Ep(x|a)[log p(X|A = ai)]

Therefore, we can use a similar procedure to that described above to estimate the probability distributionspX andpX|A, and by sampling from the distributions and using the law of large numbers, estimate the two terms−Ep[log(X)]andEp(x|a)[log p(X|A = ai)]. The probability distributionpA can be computed by counting the frequencies of thevalues of theA.

12.3 Graphical Model Representations of Language Model Mixtures

Another project that was undertaken during the workshop was the use of GMTK for some basic language modelingexperiments, namely using graphs to represent mixtures of component language models of various orders, and to usethe sparse conditional probability and switching hidden variable features of GMTK to implement them.

In general, decoding in speech recognition can be decomposed into two separate probability calculations accord-ing to the noisy channel model, argmax

wP (A|w)P (w). Language models approximate the joint probability over a

sequence of wordsP (w), which becomes∏

t P (wt|h = w1...t−1) using the chain rule. For n-gram language models,the word history at each time pointh is limited only to the previousn− 1 words. This repeating structure of the wordhistory lends itself nicely to the language of dynamic graphical models. Our experiments demonstrate how modularand easily trainable graphical models can be used for simple to advanced language modeling.

12.3.1 Graphical Models for Language Model Mixtures

WtWt-1Wt-2Wt-3Wt-4

Figure 58: Simple graph for a trigram language model.

For our experiments, we chose to model the standard trigram language model

Ptrigram(wt|h) = P (wt|wt−2wt−1) =N(wt, wt−2, wt−3)

N(wt−2, wt−3)(8)

whereN(wt, wt−2, wt−3) is the count of the number of times the gramwt, wt−2, wt−3 occurs in training data. whichcan be described using the directed graph shown in Figure 58. The graph shows the set of word variablesWt for eacht, and shows howWt depends on the two previous words in the history.

In general, not all possible three word sequences are seen in the a given training data, and this was of course the casein the IBM AV data that we worked with during the workshop. We therefore implemented a smoothed the probability

71

WtWt-1Wt-2Wt-3Wt-4

αtαt-1αt-2αt-3αt-4

αt∈{2,3}

αt∈{3}

αt-1∈{2,3}

αt-1∈{3}

αt-2∈{2,3}

αt-2∈{3}

αt-3∈{2,3}

αt-3∈{3}

Figure 59: Mixture of trigram, bigram, and unigram using a hidden variableα and switching parents.

distribution by linearly interpolating the trigram distribution with bigram and unigram distributions, a method wellknown as Jelinek-Mercer smoothing [59]:

P (wi|h) = α1Ptrigram + α2Pbigram + (1− α1 − α2)Punigram (9)

WherePbigram andPunigram are defined similarly as the in the trigram case, as a ratio of counts. Another way ofseeing this equation is that there exists a hidden discrete tri-variate random variable, say namedα which is used to mixbetween the various different language models. The above equation can therefore be written as:

P (wi|h) = P (α = 3)Ptrigram + P (α = 2)Pbigram + P (α = 1)Punigram (10)

Viewed in this way, we see that theα variable is really a switching parent which, depending on its value, is used todetermine the set of parents that are active. This can be seen as the graphical model shown in Figure 59. In that figure,the αt variable at each time step is a switching parent (indicated by the dashed edge, see also Section 6.1.7 whichdescribes the idea of switching parents, and how it is implemented in GMTK), and is used only to determine if someof the other edges in the graph are active or not. The values for which different edges are active are shown by thecall-out boxes, indicating theαt values which are required to activate each edge.

WtWt-1Wt-2Wt-3Wt-4


αt∈{2,3}

αt∈{3}

αt-1∈{2,3}

αt-1∈{3}

αt-2∈{2,3}

αt-2∈{3}

αt-3∈{2,3}

αt-3∈{3}

Figure 60: Mixture of trigram, bigram, and unigram using a hidden variableα and switching parents. Here,α isdependent on the history.

In the most simple case, the random variableα has fixed probability values over time, and these values are typicallylearned from some held out set not used to produce the base count distributions (i.e., deleted interpolation [59]). Itis possible, however, for these weights to be a function of and vary according to the word historyh leading to theequation:

P (wi|h) = P (α = 3|h)Ptrigram + P (α = 2|h)Pbigram + P (α = 1|h)Punigram (11)

for a hidden discrete tri-valued random variableα. This structure is shown in Figure 60.Becauseh can have quite a large state space,P (α|h) itself could be a difficult quantity to estimate. Therefore, in

order to reduce this data-sparsity problem and estimating the quantity more robustly, we can form equivalence classesof word historiesh based on the frequency of their occurrence, which are called buckets. In other words, thoseh valueswhich occurred within a given range of a certain number of times within the training data were grouped together intoone buckedB(h), and the resulting probability becameP (α|B(h)), which is necessarily a discrete distribution withlower overall cardinality.

As will be seen in the experiments below, we vary the number of buckets, thereby evaluating the trade-off betweenthe model’s robustness and its predictive accuracy. GMTK scaled quite easily to these changes, automatically learningthe varying number of weights given the appropriate graphical model structures. Perplexity results with various bucketsizes are given in the following sections.

72

Another aspect of language that is sometimes desirable is the ability to represent the notion of an optional lexicalsilence token that might occur between words (this might be a pause, or some other non-lexical entity). We will callthis entity lexical silence, denoted bysil . This is particularly important when language modeling is used along withacoustic models, as the lexical silence “word” might have quite different acoustic properties than any of the real wordsin the lexicon. Therefore, a goal is to allow for the lexical silence to occur between any word. A problem, however,that arises when this is done is that the probability model for the current word now depends on the previous wordwt−1

which might be lexical silence. It could be more beneficial to condition only on the previous “true” words (not lexicalsilences) when the context does contain lexical silence. In language modeling, this is a form of what is called a skiplanguage model, where some word in the context is skipped in certain situations.

It is possible to represent such a construct with a graph and with GMTK. We first develop the model in the bi-gramcase for simplicity, and then provide the tri-gram case (which also includes perplexity results below).

The essential problem is that the random variableWt in a graph is conditioned on the previous wordWt−1.When the previous word issil , however, the information about the previous true word is lost (since it is not in theconditioning set in the modelP (Wt|Wt−1,Wt−2)). Therefore, there must be some mechanism (graphical in this case)to keep track of what the previous true word is and use it in the conditioning set. This can be done by having an explicitextra variable which we callRt for the previousReal word in the history.Rt then becomes part of the conditioning setand is used to produce the bi-gram score rather thanWt which might besil . Also, Rt itself needs to be maintainedover time when the current word issil . WhenWt is not sil , Rt should be updated to be whatever the previousword truly is.

We first describe this in equations, giving the distribution ofWt given bothWt−1 andRt−1, and then provide agraph.

P (wt|wt−1, rt−1) =∑rt

P (wt, rt|wt−1, rt−1) (12)

=∑rt

P (rt|wt, wt−1, rt−1)P (wt|wt−1, rt−1) (13)

=∑rt

P (rt|wt, rt−1)P (wt|rt−1) (14)

The quantityP (rt|wt, rt−1) is set as follows:

P (rt|wt, rt−1) ={

δrt=rt−1 if wt = silδrt=wt if wt 6= sil

whereδi=j is the Dirac delta indicator function, which is one only wheni = j and is otherwise zero. This implemen-tation of this distribution does the following: If the current wordwt is sil , thenRt is a copy of whateverRt−1 is, sothe real word is retained from time-slice to time-slice. If, on the other hand,wt is a real word, thenRt is a copy of thatword. Therefore,wt is both a normal and a switching parent ofrt. The implementation ofP (Wt|Rt−1) is as follows:

P (wt|rt−1) =∑st

P (wt|st, rt−1)P (st)

whereSt is a hidden binary variable at timet which indicates if the current word is lexical silence. The implementationof P (wt|st, rt−1) is as follows:

P (wt|st, rt−1) ={

δwt=sil if st = 1Pbigram(wt|rt−1) if st = 0

This means that wheneverSt is “on”, it forces Wt to besil , and that is the only token that gets any (and all ofthe) probability.P (st) is simply set to the probability of lexical silence (the relative frequency can be obtained fromtraining data). The graph for this model is shown in Figure 61. In the graph,Wt has a dependency only onRt andSt.Rt usesWt both as a switching and a normal parent. The value ofWt switches the parent ofRt to be eitherWt toobtain a new real word value, orRt−1 to retain the previous real word value.

Given the above model, the following string of lexical items “Fool me oncesil , shame onsil , shame on you”will get probabilityPbg(me|Fool) Pbg(once|me) Pbg(same|once) Pbg(on|shame) Pbg(shame|on) Pbg(on|shame)Pbg(you|on) P (sil )2, where the probability of lexical silence is applied twice.

73

RtRt-1Rt-2Rt-3Rt-4

WtWt-1Wt-2Wt-3Wt-4

StSt-1St-2St-3St-4

Figure 61: An implementation of a skip-like bigram, where the previous wordRt is used to lookup the bigramprobability whenever the previous word is lexical silence

WtWt-1Wt-2Wt-3Wt-4


αt=1αt-1=1αt-2=1αt-3=1

RtRt-1Rt-2Rt-3Rt-4

StSt-1St-2St-3St-4

Figure 62: An implementation of a skip-like bigram and a mixture of bigram and unigram models, essentially acombination of a bigram version of Figure 59 and of Figure 61

The model that both skips lexical silence and also mixes between bi-gram and unigram probability may be com-bined together as one model, as shown in Figure 62.

Note that it is also possible to use two binary auxiliary hidden variables (rather than just oneα variable) to producethe mixture given in Figure 60 and described in Equation 11. The trigram decomposition can be performed as follows:

P (wt|wt−1, wt−2) = P (αt = 1|wt−1, wt−2)Ptri(wt|wt−1, wt−2) (15)

+ P (αt = 0|wt−1, wt−2)P (wt|wt−1) (16)

where

P (wt|wt−1) = P (βt = 1|wt−1)Pbi(wt|wt−1) (17)

+ P (βt = 0|wt−1)P (wt) (18)

and where we now have two hidden switching parent variables at each time slice,αt (deciding between a bi-gram andtri-gram) andβt (deciding between a bi-gram and a unigram). This model is shown graphically in Figure 63

WtWt-1Wt-2Wt-3Wt-4


βt-3=1 βt-3=1 βt-3=1 βt-3=1

αt-3=1 αt-2=1 αt-1=1 αt=1

βtβt-1βt-2βt-3βt-4

Figure 63: A mixture of a trigram, bigram, and unigram language model using two binary hidden variablesα (tocontrol trigram vs. bigram) andβ (to control bigram vs. unigram).

Moreover, it is possible to implement a trigram model that skips over contexts that containsil , similar to thebigram case. This trigram model in given in Figure 64. Combining this model together with a mixture model, weat last arrive at the model that was used for the perplexity experiments that were carried out during the 2001 JHUworkshop. This model is given in Figure 65. This model can be used for language model training and languagescoring. We can also add the remaining structure that given in the lower portion of Figure 14 to obtain a generalspeech recognition decoder that uses this mixture and skip language model.

74

WtWt-1Wt-2Wt-3Wt-4

RtRt-1Rt-2Rt-3Rt-4

VtVt-1Vt-2Vt-3Vt-4

StSt-1St-2St-3St-4

Figure 64: An implementation of a skip-like trigram, where the previous wordRt and the previous previous wordVt

are both used to lookup the trigram probability whenever the previous or previous previous word is lexical silence.The structure also includes the provisions necessary to update the true words within the history.

WtWt-1Wt-2Wt-3Wt-4

RtRt-1Rt-2Rt-3Rt-4

VtVt-1Vt-2Vt-3Vt-4

StSt-1St-2St-3St-4

αtαt-1αt-2αt-3αt-4 βt-3=1 βt-2=1 βt-1=1 βt=1

αt-3=1 αt-2=1 αt-1=1 αt=1

βtβt-1βt-2βt-3βt-4

Figure 65: A model that combines the two variable mixture between trigram, bigram, and unigram, and that alsoimplements the skipping of lexical silence at the trigram level.

12.3.2 Perplexity Experiments

We tested the language model given in Figure 65 in set of perplexity experiments. We used the IBM Audio-Visualcorpus comprised of≈ 13, 000 training utterances and≈ 13, 000 word vocabulary. Test data was a subset of thetraining utterances, and an additional subset of heldout data was used to train the weights (i.e., the distributions of thevariablesα andβ).

The trigram, bigram, and unigram probabilities were calculated offline by taking frequency counts over the trainingdata. Once the amount of buckets was specified in the structure files, the weights were learned with GMTK.

To test the language model, we calculated the probability assigned to the test data by the language model. Thiswas then converted into perplexity, a common measure of how well the language model predicts language. As is wellknown, perplexity can be thought of as the average branching factor of the model, i.e. how many words it assignsequal probability to, and it is correlated to WER for speech recognition. While perplexity reduction is often not agood predictor of overall word error reduction, in many cases it can be quite useful, especially when the perplexityreductions are large.

12.3.3 Perplexity Results

Experiments were run with linearly interpolated bigram and trigram language models, varying the number of bucketsfrom 1 to 10. The results are summarized below in terms of language model perplexity over the test data:

bigram trigram1 bucket 89.54 38.712 buckets 81.63 28.5910 buckets 80.79 28.09

Adding one more word into the history through the trigram model significantly reduced perplexity, as expectedgiven enough training data. The single bucket models performed reasonably well. The largest gains were from using

75

two buckets instead of one, which affectively lowered the weight of n-grams whose history was never seen. Morebuckets did not appear to help, as the learned weights converged to similar numbers for the nonzero history countbuckets.

12.3.4 Conclusions

The language model experiments performed with the GMTK yield comparable results to those of language modelingtoolkits. However, the similarity between the training, testing, and decoding graphical model structures allow for aneasier transition between the different phases.

Given the rich language of graphical models, extensions could be made to the trigram language model, such ashigher-order n-grams or caching trigger words. These would require the addition of new variables and dependencyarcs. Once the structures are specified, the modular, trainable framework of GMTK would allow for seamless training,testing, and decoding.

13 Future Work and Conclusions

There were many goals of the JHU 2001 workshop (see Section 1) only some of which were realized. In this section,we briefly outline some of what could be the next steps of the research that began at this workshop.

13.1 Articulatory Models

The work we have begun at the workshop has produced some simple articulatory models and much of the infras-tructure needed to build similar models. However, there is a large space of articulatory models that can be exploredwith graphical models. There are many additional dependencies that are likely to be present in speech but are notrepresented in the structures that we have constructed, as well as entirely different structures that may better model theasynchronous nature of the articulators.

One of the main ideas that we had hoped to investigate, but were not able to during the workshop, is structurelearning over the hidden articulatory variables. For a model with a large number of variables, such as an articulatorymodel, it is infeasible to include all of the possible dependencies and labor-intensive to predetermine them usinglinguistic considerations. This would therefore be a natural application for structure learning, and in particular for theideas of discriminative structure learning using the EAR measure.

13.1.1 Additional Structures

There is also a wealth of other structures to be explored. The type of structure we have built (see Figure 57) is limitedin several respects. While such a model can allow articulators to stray from their target values, it cannot truly representasynchronous articulatory streams since all of the articulatory variables depend on the current phone state. In order toallow the articulators to evolve asynchronously, new structures are needed. One possibility we have considered, butnot implemented during the workshop, is to treat each articulatory stream in the same way that the phone is treated inthe phone-based model, with its own position and transition variables. In such a model, each articulatory stream couldgo through its prescribed sequence of values at its own pace. One example of such a structure, in which there is nophone variable at all, is shown in Figure 66.

It is important, however, to constrain the asynchrony between the articulatory variables, for both computationaland linguistic reasons. This can be done by forcing them to synchronize (i.e. reach the same positions) at certainpoints, such as word or syllable boundaries, or by requiring that some minimal subset be synchronized at any point intime. As an example, the structure of Figure 66 includes the dependencies that could be used for synchronization atword boundaries. The degree of synchronization is an interesting issue that, to our knowledge, has not been previouslyinvestigated in articulatory modeling research and would be fairly straightforward to explore using graphical models.

13.1.2 Computational Issues

Computational considerations are also likely to be an important issue for future work. As mentioned in Section 11, wewere unable to train or decode with these models during the time span of the workshop. With the addition of log-space

76

...

atrN

aN

aposN

a tr1

1a

a1pos

...

atrN

aN

aposN

a tr1

1a

a1pos

wordtrans

wordtrans

...

atrN

aN

aposN

a tr1

1a

a1pos

wordtrans

frame i-1 frame i frame i+1

wordwordword

O O O

Figure 66:A phone-free articulatory model allowing for asynchrony. In this model, each feature has a variableai

representing its current value, a variableaipos representing its position within the current word, and a variableai

tr

indicating whether the feature is transitioning to its next position.

inference in GMTK, training and decoding can now be done within reasonable memory constraints, but at the expenseof increased running time. Therefore, work is still needed to enable articulatory models to run more efficiently.

One way to control the computational requirements of the model is by careful constraints on the state space.Constraints on the overall articulatory state space can be applied through the choice of articulatory variables and inter-articulator dependencies. The “instantaneous” state space can also be controlled by imposing various levels of sparsityon the articulatory probability tables.

In addition, measures can be taken to limit the size of the acoustic models (the conditional probability densities ofthe observation variable). For example, instead of having a separate model for each allowed combination of articulatoryvalues, some of the models could be tied, or product-of-experts models [55] could be used to combine observationdistributions corresponding to different articulators. We have begun to explore the product-of-experts approach inwork we have pursued since the workshop. Finally, the distribution dimensionality could be reduced by choosingonly a certain subset of the acoustic observations to depend on each articulator (which could be different for eacharticulator).

13.2 Structural Discriminability

One of the main goals of the workshop was to investigate the use of discriminative structure learning. In this work, wehave described a methodology that can learn discriminative structure between collections of observation variables. Oneof the key goals of the workshop that time-constraints prevented us from pursuing was the induction of discriminativestructure at the hidden level. For example, given a baseline articulatory network, it would be desirable to augment thatnetwork (i.e., either add or remove edges) so as to improve its overall discriminative power. Work is planned in thefuture to pursue this goal.

77

13.3 GMTK

There are many plans over the next several years for additions and improvements to GMTK. Some of these include:1) a new faster inference algorithm that utilizes an off-line triangulation algorithm, 2) approximate inference schemessuch as a variational approach, and a loopy propagation procedure, 3) non-linear dependencies between observations,4) better integration with language modeling systems, such as with the SRI language-modeling toolkit, 5) the useof hidden continuous variables, 6) adaptation techniques, and 7) general performance enhancements. Many otherenhancements are planned as well. GMTK was conceived for use at the JHU 2001 workshop, but it is believed thatit will become a useful tool for a variety of speech recognition, language modeling, and time-series processing tasksover the next several years.

78

14 The WS01 GM-ASR Team

Lastly, we would like to once again mention and acknowledge the WS01 JHU team, which consisted of the followingpeople:

Jeff A. Bilmes — University of Washington, SeattleGeoff Zweig — IBMThomas Richardson — University of Washington, SeattleKarim Filali — University of Washington, SeattleKaren Livescu — MITPeng Xu — Johns Hopkins UniversityKirk Jackson — DODYigal Brandman — Phonetact Inc.Eric Sandness — SpeechworksEva Holtz — Harvard UniversityJerry Torres — Stanford UniversityBill Byrne — Johns Hopkins University

The team is also given in Figure 67 (along with several friends who happened by at the time of the photo-shoot).Speaking as a team leader (J.B.), I would like to acknowledge and give many thanks all the team members for doinga absolutely wonderful job. I would also like to thank Sanjeev Khudanpur, Bill Byrne, and Fred Jelinek and all theother members of CLSP for creating a fantastically fertile environment in which to pursue and be creative in perform-ing novel research in speech and language processing. Lastly, we would like to thank the sponsoring organizations(DARPA, NSF, DOD) without which none of this research would have occurred.

YigalBrandman

Jeff A. Bilmes

ThomasRichardson Eric

Sandness

Geoff Zweig Peng Xu

Kirk Jackson

EvaHoltz

Jerry Torrez

KarenLivescu

KarimFilali

SanjeevKhudanpur(Fred's 2001Substitute)

JacobLaderman(kept thecomputers running)

Figure 67: The JHU WS01 GM Team. To see the contents of the T-shirts we are wearing, see Figure 68.

References

[1] A. V. Aho, R. Sethi, and J. D. Ullman.Compilers: Principles, Techniques and Tools. Addison-Wesley, Inc.,Reading, Mass., 1986.

79

[2] S. M. Aji and R. J. McEliece. The generalized distributive law.IEEE Transactions in Information Thoery,46:325–343, March 2000.

[3] J.J. Atick. Could information theory provide an ecological theory of sensory processing?Network, 3, 1992.

[4] L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information estimation of HMMparameters for speech recognition. InProc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing,pages 49–52, Tokyo, Japan, December 1986.

[5] J. Baker. The Dragon system—an overview.IEEE Transactions on Acoustics, Speech, and Signal Processing,23:24–29, 1975.

[6] J. Bilmes. Natural Statistical Models for Automatic Speech Recognition. PhD thesis, U.C. Berkeley, Dept. ofEECS, CS Division, 1999.

[7] J. Bilmes. Graphical models and automatic speech recognition. Technical Report UWEETR-2001-005, Uni-versity of Washington, Dept. of EE, 2001.

[8] J. Bilmes. The gmtk documentation, 2002. http://ssli.ee.washington.edu˜ bilmes/gmtk.

[9] J. Bilmes and G. Zweig. The Graphical Models Toolkit: An open source software system for speech andtime-series processing.Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2002.

[10] J.A. Bilmes. Buried Markov models for speech recognition. InProc. IEEE Intl. Conf. on Acoustics, Speech,and Signal Processing, Phoenix, AZ, March 1999.

[11] J.A. Bilmes. Dynamic Bayesian Multinets. InProceedings of the 16th conf. on Uncertainty in Artificial Intelli-gence. Morgan Kaufmann, 2000.

[12] J.A. Bilmes. Factored sparse inverse covariance matrices. InProc. IEEE Intl. Conf. on Acoustics, Speech, andSignal Processing, Istanbul, Turkey, 2000.

[13] J. Binder, K. Murphy, and S. Russell. Space-efficient inference in dynamic probabilistic networks.Int’l, JointConf. on Artificial Intelligence, 1997.

[14] C. Bishop.Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.

[15] C. P. Browman and L. Goldstein. Articulatory phonology: An overview.Phonetica, 49:155–180, 1992.

[16] The BUGS project. http://www.mrc-bsu.cam.ac.uk/bugs/Welcome.html.

[17] W. Buntine. A guide to the literature on learning probabilistic networks from data.IEEE Trans. on Knowledgeand Data Engineering, 8:195–210, 1994.

[18] K.P. Burnham and D.R. Anderson.Model Selection and Inference : A Practical Information-Theoretic Ap-proach. Springer-Verlag, 1998.

[19] R. Chellappa and A. Jain, editors.Markov Random Fields: Theory and Application. Academic Press, 1993.

[20] C.-P. Chen, K. Kirchhoff, and J. Bilmes. Towards simple methods of noise-robustness. Technical ReportUWEETR-2002-002, University of Washington, Dept. of EE, 2001.

[21] D.M. Chickering. Learning from Data: Artificial Intelligence and Statistics, chapter Learning Bayesian net-works is NP-complete, pages 121–130. Springer-Verlag, 1996.

[22] N. Chomsky and M. Halle.The Sound Pattern of English. New York: Harper and Row, 1968.

[23] G. Cooper and E. Herskovits. Computational complexity of probabilistic inference using Bayesian belief net-works. Artificial Intelligence, 42:393–405, 1990.

[24] T.H. Cormen, C.E. Leiserson, and R.L. Rivest.Introduction to Algorithms. McGraw Hill, 1990.

80

[25] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter.Probabilistic Networks and Expert Systems.Springer, 1999.

[26] D.R. Cox and D.V. Hinkley.Theoretical Statistics. Chapman and Hall/CRC, 1974.

[27] P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian belief networks is NP-hard.Artifi-cial Intelligence, 60(141-153), 1993.

[28] Journal: Data mining and knowledge discovery. Kluwer Academic Publishers. Maritime Institute of Technol-ogy, Maryland.

[29] T. Dean and K. Kanazawa. Probabilistic temporal reasoning.AAAI, pages 524–528, 1988.

[30] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm.J. Royal Statist. Soc. Ser. B., 39, 1977.

[31] L. Deng and K. Erler. Structural design of hidden markov model speech recognizer using multivalued phoneticfeatures: Comparison with segmental speech units.Journal of the Acoustical Society of America, 92(6):3058–3067, Dec 1992.

[32] L. Deng, G. Ramsay, and D. Sun. Production models as a structural basis for automatic speech recognition.Speech Communication, 33(2-3):93–111, Aug 1997.

[33] R.O. Duda and P.E. Hart.Pattern Classification and Scene Analysis. John Wiley and Sons, Inc., 1973.

[34] R.O. Duda, P.E. Hart, and D.G. Stork.Pattern Classification. John Wiley and Sons, Inc., 2000.

[35] E. Eide. Distinctive features for use in an automatic speech recognition system. InEurospeech-99, 2001.

[36] K. Elenius and M. Blomberg. Effects of emphasizing transitional or stationary parts of the speech signal in adiscrete utterance recognition system. InProc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing,pages 535–538. IEEE, 1982.

[37] C. Neti et. al. Audo-visual speech recognition: Ws 2000 final report, 2000.http://www.clsp.jhu.edu/ws2000/finalreports/avsr/ws00avsr.pdf.

[38] R. Fletcher.Practical Methods of Optimization. John Wiley & Sons, New York, NY, 1980.

[39] E. Fosler-Lussier, S. Greenberg, and N. Morgan. Incorporating contextual phonetics into automatic speechrecognition. InProceedings 14th International Congress of Phonetic Sciences, San Francisco, CA, 1999.

[40] B. Frey.Graphical Models for Machine Learning and Digital Communication. MIT Press, 1998.

[41] J. H. Friedman. Multivariate adaptive regression splines.The Annals of Statistics, 19(1):1–141, 1991.

[42] N. Friedman and M. Goldszmidt.Learning in Graphical Models, chapter Learning Bayesian Networks withLocal Structure. Kluwer Academic Publishers, 1998.

[43] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic networks.14th Conf.on Uncertainty in Artificial Intelligence, 1998.

[44] K. Fukunaga.Introduction to Statistical Pattern Recognition, 2nd Ed.Academic Press, 1990.

[45] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesianmultinets.Artificial Intelligence, 82:45–74, 1996.

[46] Z. Ghahramani. Lecture Notes in Artificial Intelligence, chapter Learning Dynamic Bayesian Networks.Springer-Verlag, 1998.

[47] J. A. Goldsmith.Autosegmental and Metrical Phonology. B. Blackwell, Cambridge, MA, 1990.

[48] R. A. Gopinath. Maximum likelihood modeling with gaussian distributions for classification. InProc. IEEEIntl. Conf. on Acoustics, Speech, and Signal Processing, 1998.

81

[49] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti. Maximum entropy and MCE based HMM stream weightestimation for audio-visual asr. InProc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2002.

[50] S. Greenberg, S. Chang, and J. Hollenback. An introduction to the diagnostic evaluation of the switchboard-corpus automatic speech recognition systems. InProc. NIST Speech Transcription Workshop, College Park,MD, 2000.

[51] D. Heckerman. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft,1995.

[52] D. Heckerman, Max Chickering, Chris Meek, Robert Rounthwaite, and Carl Kadie. Dependency networks fordensity estimation, collaborative filtering, and data visualization. InProceedings of the 16th conf. on Uncer-tainty in Artificial Intelligence. Morgan Kaufmann, 2000.

[53] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The combination of knowledgeand statistical data. Technical Report MSR-TR-94-09, Microsoft, 1994.

[54] J. Hertz, A. Krogh, and R.G. Palmer.Introduction to the Theory of Neural Computation. Allan M. Wylde, 1991.

[55] G. Hinton. Products of experts. InProc. Ninth Int. Conf. on Artificial Neural Networks, 1999.

[56] H. G. Hirsch and D. Pearce. The aurora experimental framework for the performance evaluations of speechrecognition systems under noisy conditions.ICSA ITRW ASR2000, September 2000.

[57] The ISIP public domain speech to text system. http://www.isip.msstate.edu/projects/speech/software/index.html.

[58] T.S. Jaakkola and M.I. Jordan.Learning in Graphical Models, chapter Improving the Mean Field Approxima-tions via the use of Mixture Distributions. Kluwer Academic Publishers, 1998.

[59] F. Jelinek.Statistical Methods for Speech Recognition. MIT Press, 1997.

[60] F.V. Jensen.An Introduction to Bayesian Networks. Springer, 1996.

[61] M.I. Jordan and C. M. Bishop, editors.An Introduction to Graphical Models. to be published, 2001.

[62] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul.Learning in Graphical Models, chapter An Intro-duction to Variational Methods for Graphical Models. Kluwer Academic Publishers, 1998.

[63] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech recognition.IEEETrans. on Speech and Audio Signal Processing, 5(3):257–265, May 1997.

[64] K. Kirchhoff. Syllable-level desynchronisation of phonetic features for speech recognition. InProceedingsICSLP 1996, 1996.

[65] K. Kirchhoff. Robust Speech Recognition Using Articulatory Information. PhD thesis, University of Bielefeld,Germany, 1999.

[66] K. Kjaerulff. Triangulation of graphs - algorithms giving small total space. Technical Report R90-09, Depart-ment of Mathematics and Computer Science. Aalborg University., 1990.

[67] P. Krause. Learning probabilistic networks.Philips Research Labs Tech. Report., 1998.

[68] F. R. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm.IEEE Trans.Inform. Theory, 47(2):498–519, 2001.

[69] S.L. Lauritzen.Graphical Models. Oxford Science Publications, 1996.

[70] C.-H. Lee, E. Giachin, L.R. Rabiner, R. Pieraccini, and A.E. Rosenberg. Improved acoustic modeling forspeaker independent large vocabulary continuous speech recognition.Proc. IEEE Intl. Conf. on Acoustics,Speech, and Signal Processing, 1991.

[71] H. Linhart and W. Zucchini.Model Selection. Wiley, 1986.

82

[72] J. Luettin, G. Potamianos, and C. Neti. Asynchronous stream modeling for large vocabulary audio-visual speechrecognition. InProc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2001.

[73] D.J.C. MacKay.Learning in Graphical Models, chapter Introduction to Monte Carlo Methods. Kluwer Aca-demic Publishers, 1998.

[74] D. McAllaster, L. Gillick, F. Scattone, and M. Newman. Fabricating conversational speech data with acousticmodels: A program to examine model-data mismatch. InICSLP, 1998.

[75] D. McAllaster, L. Gillick, F. Scattone, and M. Newman. Studies with fabricated switchboard data: Exploringsources of model-data mismatch. InProc. DARPA Workshop Conversational Speech Recognition, Lansdowne,VA, 1998.

[76] G.J. McLachlan.Discriminant Analysis and Statistical Pattern Recognition. Wiley Series in Probability andStatistics, 1992.

[77] G.J. McLachlan and T. Krishnan.The EM Algorithm and Extensions. Wiley Series in Probability and Statistics,1997.

[78] C. Meek. Causal inference and causal explanation with background knowledge. In Besnard, Philippe and SteveHanks, editors,Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI’95), pages403–410, San Francisco, CA, USA, August 1995. Morgan Kaufmann Publishers.

[79] M. Meila. Learning with Mixtures of Trees. PhD thesis, MIT, 1999.

[80] H. Meng. The use of distinctive features for automatic speech recognition. Master’s thesis, MassachusettsInstitute of Technology, 1991.

[81] K. Murphy. The Matlab bayesian network toolbox. http://www.cs.berkeley.edu/˜ murphyk/Bayes/bnsoft.html.

[82] Y. Normandin. An improved mmie training algorithm for speaker indepedendent, small vocabulary, continuousspeech recognition. InProc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1991.

[83] M. Ostendorf. Moving beyond the ‘beads-on-a-string’ model of speech. InProc. IEEE Automatic SpeechRecognition and Understanding Workshop, Keystone, CO, 1999.

[84] J. Pearl.Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann,2nd printing edition, 1988.

[85] J. Pearl.Causality. Cambridge, 2000.

[86] T. Poggio and F. Girosi. Networks for approximation and learning.Proc. IEEE, 78:1481–1497, September1990.

[87] L.R. Rabiner and B.-H. Juang.Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series,1993.

[88] M. Richardson, J. Bilmes, and C. Diorio. Hidden-articulator markov models for speech recognition. InProc.of the ISCA ITRW ASR2000 Workshop, Paris, France, 2000. LIMSI-CNRS.

[89] M. Richardson, J. Bilmes, and C. Diorio. Hidden-articulator markov models: Performance improvements androbustness to noise. InProc. Int. Conf. on Spoken Language Processing, Beijing, China, 2000.

[90] T. S. Richardson.Learning in Graphical Models, chapter Chain Graphs and Symmetric Associations. KluwerAcademic Publishers, 1998.

[91] M. Riley and A. Ljolje. Automatic Speech and Speaker Recognition, chapter Automatic generation of detailedpronunciation lexicons. Kluwer Academic Publishers, Boston, 1996.

[92] J. Rissanen. Stochastic complexity (with discussions).Journal of the Royal Statistical Society, 49:223–239,252–265, 1987.

83

[93] M. Saraclar and S. Khudanpur. Properties of pronunciation change in conversational speech recognition. InProc. NIST Speech Transcription Workshop, College Park, MD, 2000.

[94] L.K. Saul, T. Jaakkola, and M.I. Jordan. Mean field theory for sigmoid belief networks.JAIR, 4:61–76, 1996.

[95] G. Schwartz. Estimating the dimension of a model.Annals of Statistics, 1978.

[96] R.D. Shachter. Bayes-ball: The rational pastime for determining irrelevance and requisite information in beliefnetworks and influence diagrams. InUncertainty in Artificial Intelligence, 1998.

[97] S. Sivadas and H. Hermansky. Hierarchical tandem feature extraction. InProc. IEEE Intl. Conf. on Acoustics,Speech, and Signal Processing, 2002.

[98] P. Smyth, D. Heckerman, and M.I. Jordan. Probabilistic independence networks for hidden Markov probabilitymodels. Technical Report A.I. Memo No. 1565, C.B.C.L. Memo No. 132, MIT AI Lab and CBCL, 1996.

[99] CMU Sphinx: Open source speech recognition. http://www.speech.cs.cmu.edu/sphinx/Sphinx.html.

[100] H. Tong. Non-linear Time Series: A Dynamical System Approach.Oxford Statistical Science Series 6. OxfordUniversity Press, 1990.

[101] V. Vapnik. Statistical Learning Theory. Wiley, 1998.

[102] T. Verma and J. Pearl. Equivalence and synthesis of causal models. InUncertainty in Artificial Intelligence.Morgan Kaufmann, 1990.

[103] T. Verma and J. Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation.In Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1992.

[104] Y. Weiss. Correctness of local probability propagation in graphical models with loops.Neural Computation,Submitted.

[105] J. Whittaker.Graphical Models in Applied Multivariate Statistics. John Wiley and Son Ltd., 1990.

[106] J.G. Wilpon, C.-H. Lee, and L.R. Rabiner. Improvements in connected digit recognition using higher orderspectral and energy features.Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1991.

[107] P.C. Woodland and D. Povey. Large scale discriminative training for speech recognition. InICSA ITRWASR2000, 2000.

[108] S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland.The HTK Book. Entropic Labs and CambridgeUniversity, 2.1 edition, 1990’s.

[109] G. Zweig.Speech Recognition with Dynamic Bayesian Networks. PhD thesis, U.C. Berkeley, 1998.

[110] G. Zweig and M. Padmanabhan. Exact alpha-beta computation in logarithmic space with application to mapword graph construction.Int. Conf. on Spoken Lanugage Processing, 2000.

[111] G. Zweig and S. Russell. Probabilistic modeling with bayesian networks for automatic speech recognition.Australian Journal of Intelligent Information Processing, 5(4):253–260, 1999.

84

Jeff Bilmes Geoff Zweig

Thomas Richardson

Peng Xu

WS-2001

Discriminatively Structured

Graphical Models for

Automatic Speech Recognition

Karen Livescu Karim Filali

Eva Holtz Jerry Torres

Kirk Jackson

Eric Sandness

Yigal Brandman Bill Byrne

X Y Z

C

University of Washington, IBM Research, DOD, Johns

Hopkins University, MIT, SpeechWorks, Harvard University,

Stanford Univeristy, Phonetact Inc.

Figure 68: The JHU WS01 GM Team Graph.

85

Documents

Discriminatively Structured Graphical Models for Speech ...ssli.ee.washington.edu/ssli/people/karim/papers/GMRO-final-rpt.pdf · Discriminatively Structured Graphical Models for Speech