Synchronous HMMs for Audio-Visual Speech Processingeprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf · Synchronous HMMs for Audio-Visual Speech Processing by David Dean, BEng (Hons),

Synchronous HMMs for Audio-Visual

Speech Processing

by

David Dean, BEng (Hons), BIT

PhD Thesis

Submitted in Fulfilment

of the Requirements

for the Degree of

Doctor of Philosophy

at the

Queensland University of Technology

Faculty of Engineering

July 2008

ii

Keywords

Speech processing, speech recognition, speaker recognition, speaker verification, multi-

modal, audio-visual, data fusion, pattern recognition, hidden Markov models, syn-

chronous hidden Markov models

iv

Abstract

Both human perceptual studies and automatic machine-based experiments have shown

that visual information from a speaker’s mouth region can improve the robustness of

automatic speech processing tasks, especially in the presence of acoustic noise. By

taking advantage of the complementary nature of the acoustic and visual speech in-

formation, audio-visual speech processing (AVSP) applications can work reliably in

more real-world situations than would be possible with traditional acoustic speech

processing applications. The two most prominent applications of AVSP for viable

human-computer-interfaces involve the recognition of the speech events themselves,

and the recognition of speaker’s identities based upon their speech. However, while

these two fields of speech and speaker recognition are closely related, there has been

little systematic comparison of the two tasks under similar conditions in the existing

literature. Accordingly, the primary focus of this thesis is to compare the suitability of

general AVSP techniques for speech or speaker recognition, with a particular focus on

synchronous hidden Markov models (SHMMs).

The cascading appearance-based approach to visual speech feature extraction has been

shown to work well in removing irrelevant static information from the lip region to

greatly improve visual speech recognition performance. This thesis demonstrates that

these dynamic visual speech features also provide for an improvement in speaker

recognition, showing that speakers can be visually recognised by how they speak,

in addition to their appearance alone.

vi

This thesis investigates a number of novel techniques for training and decoding of

SHMMs that improve the audio-visual speech modelling ability of the SHMM ap-

proach over the existing state-of-the-art joint-training technique. Novel experiments

are conducted within to demonstrate that the reliability of the two streams during

training is of little importance to the final performance of the SHMM. Additionally,

two novel techniques of normalising the acoustic and visual state classifiers within the

SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM)

adaptation is introduced as a novel method of adapting SHMMs from existing well-

performing acoustic hiddenMarkovmodels (HMMs). This technique is demonstrated

to provide improved audio-visual modelling over the jointly-trained SHMMapproach

at all levels of acoustic noise for the recognition of audio-visual speech events. How-

ever, the close coupling of the SHMM approach will be shown to be less useful for

speaker recognition, where a late integration approach is demonstrated to be supe-

rior.

Contents

Keywords iii

Abstract v

List of Tables xvii

List of Figures xix

Commonly used Abbreviations xxv

Certification of Thesis xxvii

Acknowledgements xxix

Chapter 1 Introduction 1

1.1 Motivation and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

viii CONTENTS

1.4 Original contributions of thesis . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Publications resulting from research . . . . . . . . . . . . . . . . . . . . . 7

1.5.1 International journal publications . . . . . . . . . . . . . . . . . . 7

1.5.2 International conference publications . . . . . . . . . . . . . . . . 8

Chapter 2 Audio-Visual Speech Processing 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Audio-visual speech processing by humans . . . . . . . . . . . . . . . . . 12

2.2.1 The speech chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Speech production . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Phonemes and visemes . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.4 Audio-visual speech perception . . . . . . . . . . . . . . . . . . . 17

2.2.5 Audio-visual speaker perception . . . . . . . . . . . . . . . . . . . 18

2.3 Automatic audio-visual speech processing . . . . . . . . . . . . . . . . . 21

2.3.1 Audio-visual speech recognition . . . . . . . . . . . . . . . . . . . 22

2.3.2 Audio-visual speaker recognition . . . . . . . . . . . . . . . . . . 25

2.3.3 Comparing speech and speaker recognition . . . . . . . . . . . . . 26

2.4 Audio-visual databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.1 A brief review of audio-visual databases . . . . . . . . . . . . . . 27

CONTENTS ix

2.4.2 The XM2VTS database . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Chapter 3 Speech and Speaker Classification 33

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Non-parametric classifiers . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.3 Parametric classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 GMM complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 GMM parameter estimation . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.1 Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.3 Viterbi decoding algorithm . . . . . . . . . . . . . . . . . . . . . . 46

3.4.4 HMM parameter estimation . . . . . . . . . . . . . . . . . . . . . . 47

3.4.5 HMM types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Speaker adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

x CONTENTS

3.5.1 MAP adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 4 Speech and Speaker Recognition Framework 59

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.1 Speaker dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2.2 Speech decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.1 Text dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.2 Background adaptation . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.3 Evaluating speaker recognition performance . . . . . . . . . . . . 68

4.4 Speech processing framework† . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.1 Training and testing datasets . . . . . . . . . . . . . . . . . . . . . 72

4.4.2 Background training . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.3 Speaker adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.4 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.5 Speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5 Acoustic and visual conditions . . . . . . . . . . . . . . . . . . . . . . . . 77

CONTENTS xi

4.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Chapter 5 Feature Extraction 79

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Acoustic feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.3 Filter bank analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.4 Mel frequency Cepstral coefficients . . . . . . . . . . . . . . . . . 82

5.2.5 Perceptual linear prediction . . . . . . . . . . . . . . . . . . . . . . 83

5.2.6 Energy and time derivative features . . . . . . . . . . . . . . . . . 83

5.3 Visual front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3.1 The front-end effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2 A brief review of visual front-ends . . . . . . . . . . . . . . . . . . 86

5.3.3 Manual front-end implementation . . . . . . . . . . . . . . . . . . 87

5.4 Visual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4.1 Appearance based . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.2 Contour based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.3 Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xii CONTENTS

5.4.4 Choosing a visual feature extraction method . . . . . . . . . . . . 93

5.5 Dynamic visual speech features . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.2 Cascading appearance-based features . . . . . . . . . . . . . . . . 96

5.6 Comparing speech and speaker recognition . . . . . . . . . . . . . . . . . 103

5.6.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.6.2 Model training and tuning . . . . . . . . . . . . . . . . . . . . . . 105

5.7 Speech recognition experiments . . . . . . . . . . . . . . . . . . . . . . . . 107

5.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.8 Speaker verification experiments† . . . . . . . . . . . . . . . . . . . . . . . 111

5.8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.9 Speech and speaker discussion . . . . . . . . . . . . . . . . . . . . . . . . 113

5.10 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Chapter 6 Simple Integration Strategies 115

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.2 Integration strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

CONTENTS xiii

6.3 Early integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3.2 Concatenative feature fusion . . . . . . . . . . . . . . . . . . . . . 119

6.3.3 Discriminative feature fusion . . . . . . . . . . . . . . . . . . . . . 119

6.4 Late integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.4.2 Output score fusion for speaker verification . . . . . . . . . . . . 123

6.4.3 Score-normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4.4 Modality weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.5 Speech recognition experiments . . . . . . . . . . . . . . . . . . . . . . . . 128

6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.6 Speaker verification experiments . . . . . . . . . . . . . . . . . . . . . . . 131

6.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.7 Speech and speaker discussion . . . . . . . . . . . . . . . . . . . . . . . . 133

6.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Chapter 7 Synchronous HMMs 135

xiv CONTENTS

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.2 Multi-stream HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.3 Synchronous HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.3.2 SHMM joint-training . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.4 Weighting of synchronous HMMs† . . . . . . . . . . . . . . . . . . . . . . 142

7.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.5 Normalisation of synchronous HMMs† . . . . . . . . . . . . . . . . . . . 145

7.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.5.2 Determining normalisation parameters . . . . . . . . . . . . . . . 148

7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.6 Speech recognition experiments† . . . . . . . . . . . . . . . . . . . . . . . 152

7.6.1 Choosing the stream weight parameters . . . . . . . . . . . . . . . 152

7.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

CONTENTS xv

Chapter 8 Fused HMM-Adaptation of Synchronous HMMs 159

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8.2 Discrete fused HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.2.2 Maximising mutual information for audio-visual speech . . . . . 161

8.2.3 Discrete implementation . . . . . . . . . . . . . . . . . . . . . . . . 163

8.3 Fused HMM adaptation of synchronous HMMs† . . . . . . . . . . . . . . 164

8.3.1 Continuous FHMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3.2 Fused-HMM adaptation . . . . . . . . . . . . . . . . . . . . . . . . 165

8.4 Biasing of FHMM-adapted SHMMs† . . . . . . . . . . . . . . . . . . . . . 168

8.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.4.2 Acoustic or visual biased . . . . . . . . . . . . . . . . . . . . . . . 169

8.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.5 Speech recognition experiments† . . . . . . . . . . . . . . . . . . . . . . . 171

8.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

8.6 Speaker verification experiments† . . . . . . . . . . . . . . . . . . . . . . . 175

8.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8.6.2 Stream weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

xvi CONTENTS

8.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

8.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Chapter 9 Conclusions and Future Work 181

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Bibliography 187

List of Tables

4.1 Configurations of the XM2VTS clients possible under this framework. . 74

5.1 HMM topologies used for the uni-modal speech processing experiments. 106

5.2 WERs for speech recognition on all 12 configurations of the XM2VTS

database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.1 Normalisation parameters determined from the per-frame evaluation

score distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.2 Final weighting parameter α f inal calculated from the intended weight-

ing parameter αtest using the normalisation parameter αnorm = 0.751. . . 150

xviii LIST OF TABLES

List of Figures

2.1 Schematic diagram of human speech communication, considering only

the auditory systems (Adapted from [153]) . . . . . . . . . . . . . . . . . 13

2.2 Sagittal section of the human speech production system. (public do-

main, from [191]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Some examples of raw frame images from the XM2VTS database [119]. . 30

2.4 Configurations for person recognition defined by the XM2VTS proto-

col [107]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 AMarkov process can bemodelled as a statemachine with probabilistic

transitions (aij) between states at discrete intervals of time (t = 1,2, . . .). . 42

3.2 A diagrammatic representation of typical left-to-right HMM for speech

processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 A typical speech recognition system, outlining both the training of speech

models and testing using these models. . . . . . . . . . . . . . . . . . . . 60

4.2 Speaker dependent-speech recognition can be impractical for some ap-

plications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

xx LIST OF FIGURES

4.3 An example of a possible voice-dialling speech grammar for continuous

speech recognition. Adapted from [194]. . . . . . . . . . . . . . . . . . . . 63

4.4 A typical automatic speaker recognition system, outlining both the train-

ing of speaker models and testing using these models. . . . . . . . . . . . 65

4.5 An example of a DET plot comparing two systems for speaker verifica-

tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6 Overview of the speech processing framework used in this thesis. . . . 72

4.7 Word recognition grammar used in this framework. . . . . . . . . . . . . 76

5.1 Configuration of an acoustic feature vector including the static (ci) and

energy (E) coefficients and their corresponding delta and acceleration

coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 The visual feature extraction process, highlighting the visual front end,

encompassing the localisation, tracking and normalisation of the lip ROI. 85

5.3 Manual tracking was performed by recording the eye and lip locations

every 50 frames and interpolating between. . . . . . . . . . . . . . . . . 87

5.4 Some examples of the original and grey-scaled resized ROIs extracted

from the XM2VTS database. . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Contour-based feature extractions used the geometry of the lip region

as the basis of feature extraction. . . . . . . . . . . . . . . . . . . . . . . . 91

5.6 Overview of the dynamic visual feature extraction system used for this

thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.7 Most of the energy of a 2D-DCT resides in the lower-order coefficients,

and can be collected easily using a zig-zag pattern. . . . . . . . . . . . . 99

LIST OF FIGURES xxi

5.8 Text dependent speaker verification performance on all 12 configura-

tions of the XM2VTS database. . . . . . . . . . . . . . . . . . . . . . . . . 112

6.1 Overview of the feature fusion systems used for this thesis, covering

both concatenative and discriminative feature fusion. . . . . . . . . . . . 120

6.2 Overview of the output score fusion approach used for this thesis. . . . 123

6.3 Histograms of speaker verification scores (a) before and (b) after nor-

malisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.4 Performance of weighted output score fusion for speaker verification as

α is varied from 0 to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.5 Speaker-independent feature-fusion speech recognition performance av-

eraged over all 12 configurations of the XM2VTS database. . . . . . . . . 129

6.6 Speaker-dependent feature-fusion speech recognition performance av-

eraged over all 12 configurations of the XM2VTS database. . . . . . . . . 130

6.7 Simple integration strategies for text-dependent speaker verification over

noisy acoustic conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.1 Various multi-stream HMM modelling techniques used for AVSP in

comparison to the uni-modal HMM (a). Acoustic emission densities

are shown in blue and visual in red. . . . . . . . . . . . . . . . . . . . . . 137

7.2 Speech recognition performance using SHMMs as αtest is varied. Each

point represents a different αtrain and the line is the average of all αtrains

for each αtest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.3 Speech recognition performance using SHMMs as αtrain is varied. αtest

is chosen based on the best average performance in Figure 7.2. . . . . . . 144

xxii LIST OF FIGURES

7.4 Distribution of per-frame scores for individual A-PLP audio and video

state-models within the SHMM under different types of normalisation. . 148

7.5 Speech recognition performance under normalisation . . . . . . . . . . . 151

7.6 Speaker independent speech recognition using full-normalised word-

model SHMMs as αtest is varied on the first configuration of the XM2VTS

database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.7 Speaker-independent speech recognition performance using SHMMs

over all 12 configurations of the XM2VTS database. . . . . . . . . . . . . 154

8.1 By replacing the discrete secondary representations with continuous

representations in Pan et al.’s [130] original FHMM, it can be seen that

a SHMMwill be created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

8.2 Performance of acoustic and visual biased FHMM-adapted SHMMs as

testing stream weights are varied. . . . . . . . . . . . . . . . . . . . . . . . 170

8.3 Speaker independent speech recognition performance using FHMM-

adapted HMMs over all 12 configurations of the XM2VTS database. . . . 171

8.4 Speaker dependent speech recognition performance using FHMM-adapted

HMMs over all 12 configurations of the XM2VTS database. . . . . . . . . 172

8.5 Comparing the A-PLP biased FHMM-adapted SHMM with a equiva-

lent jointly-trained SHMM on the first configuration of the XM2VTS

database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.6 Tuning the testing streamweight parameter αtest for speaker verification

using FHMM-adapted SHMMs. . . . . . . . . . . . . . . . . . . . . . . . . 176

LIST OF FIGURES xxiii

8.7 Text-dependent speaker recognition performance using FHMM-adapted

HMMs over all 12 configurations of the XM2VTS database. . . . . . . . . 178

xxiv LIST OF FIGURES

Commonly used Abbreviations

AVICAR Audio-visual Speech Corpus in a Car Environment (database)

AVSP Audio-visual speech processing

AVSPR Audio-visual speaker recognition

AVSR Audio-visual speech recognition

CUAVE Clemson University Audio Visual Experiments (database)

DCT Discrete cosine transform

DET Detection error tradeoff

EER Equal error rate

EM Expectation maximisation

FF Feature fusion

FHMM Fused HMM

GMM Gaussian mixture model

HCI Human-computer interface

HMM Hidden Markov model

HTK HMM Toolkit (software)

LDA Linear discriminant analysis

xxvi Commonly used Abbreviations

M2VTS MultiModal Verification for Teleservices and Security applications (database)

MAP Maximum a posterior

MFCC Mel frequency cepstral coefficients

MRDCT Mean-removed DCT

PLP Perceptual linear predictive

ROI Region of interest

SD Speaker dependent

SHMM Synchronous HMM

SI Speaker independent

SNR Signal to noise ratio

TD Text dependent

TI Text independent

TIMIT An acoustic speech database developed by Texas Instruments (TI) and Mas-

sachusetts Institute of Technology (MIT)

WER Word error rate

XM2VTS ExtendedM2VTS (database)

Certification of Thesis

The work contained in this thesis has not been previously submitted for a degree or

diploma at any other higher educational institution. To the best of my knowledge and

belief, the thesis contains no material previously published or written by another per-

son except where due reference is made.

Signed:

Date:

xxviii

Acknowledgements

Completing a PhD research programme is certainly one of the more interesting ex-

periences I have had, and a lot of people have helped me along the way. While it

is probably not possible to thank everyone (if only due to my poor memory), there

are certain people who must be mentioned. Firstly and most importantly, I would

like to thank my lovely wife Melly and (sometimes) lovely boys Axel and Henry for

the support they provided, and especially for putting up with me as I experimented

with weird working hours during the hectic write-up stage that produced this final

document. In addition, I would like to thank my parents for the encouragement and

support they have always provided me.

I would also like to thank my supervisory team, Sridha Sridharan, Vinod Chandran

and Tim Wark for providing valuable guidance and encouragement throughout the

course of my study. I am particularly indebted to Sridha for the excellent research

environment he has provided in the Speech, Audio Image and Video Technologies

(SAIVT) research laboratory at Queensland University of Technology (QUT), and the

many opportunities I have had to present my research at both domestic and interna-

tional conferences. I am also thankful for my regular meetings with Tim to discuss the

direction of my research, and the help he has provided in nutting out the difficult little

problems that came up along the way. It should also be mentioned that part of this

PhD was supported by the Australian Research Council Grant No. LP0562101, and I

am grateful for that support.

xxx Acknowledgements

During my PhD I was fortunate to present my research at a number of significant

speech processing conferences, and I am grateful for the opportunity that this pre-

sented for me to network with my fellow researchers from other institutions. I would

like to thank Roland Goecke, Iain Matthews and Gerasimos Potamianos, and many

other I cannot remember specifically (sorry), for listening to me and providing valu-

able feedback that significantly improved my research over what it would be without

their feedback.

Of course the group who probably had the largest impact on my research are the past

and present members of the SAIVT laboratory. In addition the incredibly valuable re-

search expertise embodied in my fellow colleagues, the great social atmospherewithin

the laboratory alsomade it a pleasure there. Particular thanksmust go tomy colleague

Patrick Lucey for his help in sorting out problems in the field of audio-visual speech

processing that we shared. Special mention must also go to Brendan Baker, Jamie

Cook, Simon Denman, Ivan Drago, Clinton Fookes, Tristan Kleinschmidt, Frank Lin,

Terrance Martin, Michael Mason, Chris McCool, Mitchel McLaren, Robbie Vogt, Roy

Wallace and Eddie Wong, who all helped me in some way, at some point.

Finally, I would like to especially thank and acknowledge everybody whom I have

forgotten above.

Chapter 1

Introduction

1.1 Motivation and overview

Automatic speech processing is a very mature area of research, and one that is play-

ing an ever-increasing role in our day-to-day lives. While these systems have shown

promise when performing well defined tasks like dictation or call-centre navigation

in reasonably clean and controlled environments, they have not yet reached the stage

where they can be fully deployed in real-world situations. The major reason behind

this is the susceptibility that audio speech recognition systems have to environmental

noise, which can degrade performance by many orders of magnitude.

However, speech does not consist of the audio modality alone, and studies of human

production and perception of speech have shown that the visual movement of the

speaker’s face and lips are an important factor in human communication.

Fortunately, many of the sources of audio degradation can be considered to have little

effect on the visual signal, and a similar assumption can also be drawn about many

sources of video degradation. By taking advantage of the complementary nature of

audio-visual speech, combining both modalities together will increase the robustness

2 1.1 Motivation and overview

to independent sources of degradation in either modality. This is the motivation be-

hind audio-visual speech processing (AVSP).

In AVSP, the method chosen for combining the two sources of speech information

remains a major area of ongoing research. Early AVSP systems could generally be

divided into two main groups, early or late integration, based on whether the two

modalities were combined before or after classification/scoring. Late integration had

the advantage that the reliability of each modality’s classifier could be weighted easily

before combination, but was difficult to use on anything but isolated word recognition

due to the problem of aligning and fusing two possibly significantly different speech

transcriptions. This was not a problem with early integration, where features are com-

bined before using a single classifier, but, on the other hand, it would be very difficult

to model the reliability of each modality.

To allow a compromise between these two extremes, middle integration schemes were

developed that allow classifier scores to be combined in a weightedmanner within the

structure of the classifier itself. The simplest of the middle integration methods, and

the subject of this thesis, is the synchronous HMM (SHMM). There are more compli-

cated middle integration designs, primarily intended to allow modelling of the asyn-

chronous nature of audio-visual speech, such as asynchronous, product or coupled

HMMs. However, while these models do show a performance increase over SHMMs,

the performance increase is not large and may not be worth the increased complex-

ity in the training and testing of the asynchronous models. It is the simplicity of the

SHMM that encourages further research into improving speech recognition perfor-

mance whilst staying within the synchronous design pattern.

This thesiswill focus on investigating the SHMMstructure for it’s suitability for audio-

visual speech and speaker recognition, in comparison to the baseline performance pro-

vided by uni-modal speech modelling as well as early and late integration stategies.

In the process of investigating the SHMM approach, a number of novel training and

testing techniques relating to the use of SHMMs for audio-visual speech modelling

will be developed. Particular attention will be paid to the novel fused HMM (FHMM)

1.2 Aims and objectives 3

adaptation process, which will be shown to produce a SHMM that can outperform

SHMMs trained using the existing state-of-the-art jointly-trained method at all levels

of acoustic noise.

1.2 Aims and objectives

It follows from Section 1.1, that the broad aims of this thesis can be summarised as

follows:

1. To investigate the suitability of existing feature extraction and integration tech-

niques for both speech and speaker recognition.

2. To study and develop techniques to improve the audio-visual speech modelling

ability of SHMMs trained using the state-of-the-art joint-training process.

3. To develop an alternative training technique for SHMMs that can improve the

audio-visual speech modelling ability in comparison to the existing state-of-the-

art joint-training process.

4. To compare and contrast the suitability of SHMMs for speech and speaker recog-

nition in comparison to existing baseline integration techniques.

More specifically, the objectives of this research programme are:

1. To review existing knowledge and techniques relevant to both speech and speaker

recognition using the audio and visual modalities.

2. To create a speech processing framework that can be used to evaluate both speech

and speaker recognition techniques, encouraging the re-use of models and tech-

niques between the two speech processing tasks where appropriate.

4 1.3 Outline of thesis

3. To investigate the state of the art in acoustic and visual feature extraction tech-

niques for audio-visual speech processing, and compare the suitability of these

features between the two speech processing tasks.

4. To review and investigate simple integration techniques that can be through fu-

sion before or after uni-modal classification techniques to serve as a baseline for

middle integration experiments.

5. To review middle integration methods for audio-visual speech processing, with

a particular focus on SHMMs due to their simplicity in comparison to other mid-

dle integration approaches.

6. To investigate the behaviour of jointly-trained SHMMs during training and test-

ing of a speech processing system, and to develop techniques to improve the

speech modelling ability within the existing training techniques.

7. To develop methods of improving the SHMM performance through FHMM-

adaptation to improve the audio-visual speech modelling ability over the ex-

isting jointly-trained SHMMs.

1.3 Outline of thesis

The remainder of this thesis is organised as follows:

Chapter 2 gives a overview of the broad area of audio-visual speech processing cov-

ering both speech production and the audio-visual perception of speech and

speakers by both humans and machines. A brief review of suitable audio-visual

speech processing databases is also conducted in this chapter.

Chapter 3 introduces the theory behind data classification, as well as outlining the

classification techniques in common use for automatic speech processing. Gaus-

sian mixture models are introduced as static speech classification models, and

1.3 Outline of thesis 5

are extended into hidden Markov models for the temporal modelling of speech

events. Finally, maximum a posterior speaker adaptation using these modelling

techniques is introduced to allow speaker dependent models to be generated

from well-trained background models.

Chapter 4 provides a detailed overview of automatic speech and speaker recognition,

covering the methods and techniques that are involved in both exercises. The

chapter is concluded with a novel framework based on the XM2VTS database

that can be used to test both speech and speaker recognition within a single

training process.

Chapter 5 looks at audio and video feature extraction techniques that have demon-

strated suitability for speech processing applications, and concludes with a com-

parison of visual features at various stages of a dynamic feature extraction cas-

cade for both the speech and speaker verification applications. Early in this

chapter, a review of both acoustic and visual feature extraction techniques is con-

ducted, with a particular focus on visual feature extraction. After a brief review

of the visual front-end, both appearance and geometric based visual feature ex-

traction techniques are reviewed. Within appearance-based feature extraction, a

number of dynamic feature extraction techniques are outlined that are designed

to extract the most relevant speech features from a given ROI. In the experi-

mental section of this chapter, a number of visual and acoustic speech features

are compared to determine the suitability of dynamic visual speech features for

both speech and speaker recognition.

Chapter 6 investigates simple methods of fusing the acoustic and visual modalities

that can be considered with the existing classification techniques already devel-

oped for uni-modal speech processing. Early integration techniques are investi-

gated for speech and speaker recognition, while late integration is only consid-

ered for speaker recognition due to the difficulty of combining speech transcrip-

tions in an output fusion configuration.

Chapter 7 reviews middle integration approaches to audio-visual speech processing

6 1.4 Original contributions of thesis

in the literature, with particular attention paid to the simplest of the middle inte-

gration methods for AVSP, the SHMM. An investigation of the SHMM structure

is conducted to investigate the effect that each modality has on the final speech

recognition performance based upon how each stream is weighted during the

training and decoding of the structure. Additionally, a number of novel clas-

sifier normalisation techniques are investigated within the SHMM structure to

improve the robustness of the SHMM to acoustic noise.

Chapter 8 introduces an alternative training technique for SHMMs that provides im-

proved audio-visual speech modelling ability when compared to the existing

state of the art training techniques for SHMMs. Experiments are conducted

with the resulting FHMM-adapted SHMMs to compare and contrast this SHMM

training technique against the earlier fusionmethods for both speech and speaker

recognition.

Chapter 9 summarises the work presented in this thesis, and presents the main con-

clusions that have been drawn from the work. This chapter also suggests future

work that may be taken to improve upon the research conducted in this thesis.

1.4 Original contributions of thesis

The work presented in this thesis makes original contributions1 in a number of differ-

ent areas, summarised as follows:

1. A novel framework for evaluating both speech and speaker recognition whilst

reusing the same speech models for both tasks is presented in Chapter 4.

2. A comparison of appearance based static and dynamic visual speech features

is conducted for visual speaker verification in Chapter 5 to show that visual

1Sections throughout this thesis which contain significant original work are indicated with the “†”symbol.

1.5 Publications resulting from research 7

speaker verification improves as more dynamic information is extracted from

the ROI.

3. A study of the effect of varying the stream weights independently during the

training and testing of SHMMs is conducted in Chapter 7 to show that the choice

of stream weight during training has a minor effect on the final speech process-

ing ability of the SHMM.

4. A novel adaptation of zero normalisation is applied within the states of a SHMM

in Chapter 7 to normalise the video scores to a similar range to the audio, allow-

ing the final SHMM to be more robust to acoustic noise.

5. An additionally variance-only normalisation technique is developed in Chap-

ter 7 to allow stream normalisation to occur within SHMMs solely through the

use of the stream weighting parameters, rather than requiring access within the

Viterbi process to apply full mean and variance normalisation.

6. The novel FHMM-adaptation method of training a SHMM from a uni-modal

acoustic or visual HMM through the additionally of separately trained GMMs

for the secondary modality is developed in Chapter 8 to show improved audio-

visual speech modelling ability over existing SHMM training techniques.

1.5 Publications resulting from research

The following fully-refereed publications have been produced as a result of the work

in this thesis:

1.5.1 International journal publications

1. D. Dean and S. Sridharan, “Dynamic visual features for audio-visual speaker

verification,” Computer Speech and Language (submitted)

8 1.5 Publications resulting from research

2. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of syn-

chronous HMMs for audio-visual speech recognition,” Digital Signal Processing

(submitted)

1.5.2 International conference publications

1. D. Dean and S. Sridharan, “Fused HMM adaptation of synchronous HMMs for

audio-visual speaker verification,” in Auditory-Visual Speech Processing (accepted),

2008

2. D. Dean, S. Sridharan, and P. Lucey, “Cascading appearance based features for

visual speaker verification,” in Interspeech 2008 (accepted), 2008

3. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Weighting and normalisation

of synchronous HMMs for audio-visual speech recognition,” in Auditory-Visual

Speech Processing, Hilvarenbeek, The Netherlands, September 2007, pp. 110–115

4. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of multi-

streamHMMs for audio-visual speech recognition,” in Interspeech, Antwerp,Au-

gust 2007, pp. 666–669

5. T. Kleinschmidt, D. Dean, S. Sridharan, and M. Mason, “A continuous speech

recognition evaluation protocol for the AVICAR database,” in International Con-

ference on Signal Processing and Communication Systems (ICSPCS) (accepted), 2007

6. D. Dean, S. Sridharan, and T. Wark, “Audio-visual speaker verification using

continuous fusedHMMs,” inHCSNetWorkshop on the Use of Vision in HCI (VisHCI),

2006

7. D. Dean, T. Wark, and S. Sridharan, “An examination of audio-visual fused

HMMs for speaker recognition,” in Second Workshop on Multimodal User Authen-

tication (MMUA), Toulouse, France, 2006

8. D. Dean, P. Lucey, and S. Sridharan, “Audio-visual speaker identification us-

ing the CUAVE database,” in Auditory-Visual Speech Processing (AVSP), British

1.5 Publications resulting from research 9

Columbia, Canada, July 24-27 2005, pp. 97–101

9. D. Dean, P. Lucey, S. Sridharan, and T.Wark, “Comparing audio and visual infor-

mation for speech processing,” in Eighth International Symposium on Signal Pro-

cessing and Its Applications (ISSPA), Sydney, Australia, 2005, pp. 58–61

10. P. Lucey, D. Dean, and S. Sridharan, “Problems associatedwith area-based visual

speech feature extraction,” in Auditory-Visual Speech Processing (AVSP), British

Columbia, Canada, 2005, pp. 73–78

10 1.5 Publications resulting from research

Chapter 2

Audio-Visual Speech Processing

2.1 Introduction

Speech is clearly one of, if not the most important communication methods avail-

able between humans, and it is the primacy of this medium that motivates research

efforts to allow speech to become a viable human-computer interface (HCI). By allow-

ing computers to recognise both speech and the identities of speakers, the interface can

be more direct, with no need for an additional format conversion (i.e., typing) in the

communications chain. These two main areas of research; using computers to recog-

nising speech, and to recognise the identities of speakers, are collectively referred to

as automatic speech processing.

Human speech is transmitted between speakers both through the acoustic speech

wave and the visual movement of the lips, and while it is may not be immediately

obvious, useful information is contained in both of these modalities. The widespread

adoption of telephones, radios and other audio-based technology clearly shows that

speech can be understood by humans with high accuracy using audio alone in good

conditions. However, when visual information is available, psychological studies

have shown that this information can and does improve speech perception. Simi-

12 2.2 Audio-visual speech processing by humans

larly, incorrect or mistimed visual information can be jarring to users and even cause

mistakes in perception in extreme cases.

This chapter will review existing research in the field of audio-visual speech pro-

cessing, covering both human perception studies and systems designed to recognise

speech or speakers automatically. Reviews of the existing literature in both human

and machine-based speech processing will be conducted to demonstrate the signifi-

cant improvements that can be realised by including visual speech information along-

side traditional acoustic speech processing.

2.2 Audio-visual speech processing by humans

Human speech is a complicated physiological processes, withmany components com-

ing into play for both the production and perception of the speech events. However,

as spoken language is one of the primary characteristics that made humans what they

are [101], the physiological basis of human speech has been studied in extensive detail,

and is reasonably well understood.

In this section a review of the human perception literature will be conducted to explore

the physiological processes involved in human speech production, speech perception

and the recognition of speakers (speaker perception). As audio-visual speech is the

focus of this research program, particular attention will be paid to the impact that the

visual modality has on these physiological processes.

2.2.1 The speech chain

At the highest level, the process of human-to-human communication can be seen as

the imperfect transmission of an idea from one mind into the other. This idea of a

speech chain [63, 153] encompasses both speech production and perception, as well as

the transmission channels between the two participants. In face-to-face communica-

2.2 Audio-visual speech processing by humans 13

Figure 2.1: Schematic diagram of human speech communication, considering only theauditory systems (Adapted from [153])

tion, this channel would simply be the acoustic sound wave and reflected light from

the speaker’s mouth region. However, the transmission channel can easily get more

complicated if, as an example, a telephone or video transmission device were intro-

duced to allow communication at a distance.

An example of such a chain, considering acoustic speech only is shown in Figure 2.1 [153].

It can be seen that the idea is first converted into a language-based representation. This

is further translated into the signals necessary to control the lungs and vocal track

(consisting of the vocal cords and mouth region) which finally generate the acoustic

wave for transmission. Once the signal reaches the listener, the movements of the

ear drum are converted back into nerve signals, then a language representation and

finally, hopefully, converging on the idea intended by the speaker.

The primary effect of including visual speech in this model does not affect the speaker,

where the visual aspect of speech can largely be considered a side effect [105], but

the listener end of the chain must additionally be cognisant of the visual information

which is then converted to nerve signals by the retina. At this point in the speech


perception process the two nerve signals (hearing and vision) are fused within the

brain to arrive at a language model and finally, the idea.

2.2.2 Speech production

Human speech is an acoustic waveform that travels in the form of sound pressure

changes through the air. This pressure wave is generated by transforming the original

expulsion of air from the lungs through the vocal folds and articulators within the vocal

tract. This term refers to the portion of the speech production system that transforms

the lung’s expelled air into recognisable human speech, and consists of the larynx,

vocal folds, pharynx and the oral and nasal cavities, shown in Figure 2.2.

The sounds produced within the vocal tract can be classified according to the actions

of a number of components within the vocal tract. Upon leaving the lungs through

the trachea, the airstream enters the larynx and encounters the vocal folds, which can

either be tightened or relaxed. If tightened, the vocal folds interfere and vibrate with

the airflow, with the resulting sound said to be voiced. Correspondingly if the vocal

folds are relaxed they do not vibrate and the sound is unvoiced. The airstream then

enters the pharynx, to be directed into either both the oral and nasal cavity, or just the

oral if the soft palate is closed. If the sound is produced with only the oral cavity it is

referred to as oral, or if the nasal cavity is also used, nasal.

Finally, the sounds produced can be further classified according to their place and

manner of articulation. In speech, articulation is the process bywhich the tongue or lips

make contact with other portions of the oral cavity to form specific speech sounds. The

manner of articulation can vary from aproximant where there is very little obstruction

of the airflow, to fricative, where the obstruction is enough to cause turbulence, and

finally to a stop, where the articulators involved completely obstruct the airflow. The

place of articulation refers which articulators are included in the speech event, which

generally will be either the tongue or the lips and another portion of the oral cavity

such as the teeth, alveolar ridge, or the soft or hard palate [93].


Figure 2.2: Sagittal section of the human speech production system. (public domain,from [191])


While body language, including facial emotions are (at least) subconsciously intended

to communicate, the visual movement of the speech articulators appears to primarily

be a side effect of the shaping of the acoustic speech, and not an intentional method

of visual communication. However, as humans have clearly adapted to make use of

this visual information, as will be shown later in this chapter, the study of how the

visible speech is related to the acoustic speech is important. Of the large number of

components involved in human speech it can be seen from Figure 2.2 that, even in the

best conditions, only a subset of the articulators are visible, being the lips, teeth and

tongue, and only the lips are visible in an un-obscured manner.

2.2.3 Phonemes and visemes

In traditional acoustic speech processing tasks, phonemes are the smallest units of speech

that can be distinguished linguistically. Two phonemes can be considered linguisti-

cally distinct if two words can be found that differ only in the two phonemes, forming

a minimal pair. An example of this would be using ‘pat’ and ‘bat’ to demonstrate

that /p/ and /b/ are distinct phonemes. It is difficult to establish a exhaustive set of

phonemes, particular if multiple languages are considered, but the International Pho-

netic Alphabet (IPA) [76] is generally considered to be the standard list. Of the 107

distinct phonemes in the IPA, only around 50 are commonly used in English [153].

Visemes are generally considered to be the equivalent of phonemes in the visual do-

main, although they do not actually serve to be linguistically distinct, but are rather

based on visual distinction [111]. Because the variety of acoustic speech events is not

completely represented by the visible articulators, each viseme generally corresponds

tomany visually similar but linguistically distinct phonemes. No real consensus exists

on the number and grouping of visemes, but generally there is considered to be on the

order of 10-20 visemes [23] as compared to the 50 or so phonemes in common English

usage.


2.2.4 Audio-visual speech perception

Human speech perception is commonly assumed to primarily be an acoustic pro-

cess [153], and humans can certainly understand speech easily when only the audio

is available, such as in telephone-based communication. For the case of visual-only

speech, the ability of the hard-of-hearing to lip-read well enough to take part in regular

conversations demonstrates that there is sufficient visual information to understand

speech, although context plays a larger part than in auditory listening [171].

However, studies of human speech perception have shown that is not just the hard-of-

hearing that make use of visual information to aid in speech perception. The earliest

such study was performed by Sumby and Pollack in 1954 [170], where they looked

at the effects of auditory noise on human speech perception with and without visual

information being available to the listener. From their experiments they found that

allowing the participants to see the lip movements provided a speech perception in-

crease equivalent to raising the auditory signal-to-noise ratio by up to 15 dB. More

recently, Reisberg et al. [156] showed that listeners with normal hearing ability still

make enough use of the visual information to show improvement in speech recogni-

tion performance even in clearly articulated speech.

Visual speech can be considered useful for human speech perception in two main

ways. Firstly, it is useful at directing the listener at the speaker, and secondly the visual

speech can provide complementary information to the acoustic. In the first case, the

visual speech can be used to allow a speaker to determine who is talking, where they

are, and even when they are actively speaking. By allowing a listener to focus on the

speaker and even to take advantages of the lip movements to filter many simultaneous

voices, such as might be encountered at a noisy party, the visual speech can considered

a speech enhancement stage prior to the actual acoustic speech perception.

Secondarily, the complementary nature of audio and visual speech can be shown by

studying the confusability of speech events in either modality. Summerfield [172] has

shown that many of the easily confused phonemes have distinct visemes that can be


easily distinguished provided the lip region is clearly visible. An example of this

would be /f/ and /th/ which can be difficult to distinguish acoustically, but can eas-

ily be distinguished visually based on whether the lower lip or tongue is against the

teeth. Summerfield also showed the converse–that the phonemes corresponding to a

particular viseme are acoustically quite different–is also true, as demonstrated by /t/

and /d/ which have the same viseme, but can be distinguished easily acoustically

because /d/ is voiced and /t/ is not.

Some of the more powerful indicators of the impact of visual speech become apparent

when visual and acoustic speech is combined incorrectly. For example, if a listener

is presented with differing audio and visual cues simultaneously, a third sound can

be perceived rather than either of the two actually ‘said’ in either modality. This is

referred to as the McGurk effect [117], of which the commonly given example is a

listener seeing ‘ga’ but hearing ‘ba’ would believe they were hearing ‘da’, rather than

either of the actual spoken syllables. This effect is also likely related to the jarring

effect that can make badly dubbed movies difficult to watch.

2.2.5 Audio-visual speaker perception

Audio-visual speaker perception is used here to refer to the process by which humans

make used of both the acoustic and visual modalities to successfully recognise the

identity of a speaking person. However, this is not a very active area of research

with most human-person-recognition research focusing on recognition from audio

speech [9] or facial images [199] in isolation. Of these two approaches, facial images

have been shown to be faster and more efficient in human recognition studies when

compared to voice recognition [50, 166].

While human person recognition is dominated by face recognition, recognition of fa-

miliar voices is still quite powerful, as evidenced by recognition over telephone lines

or radio where the visual modality isn’t available. The study of human recognition

by voice is a relatively new area of research compared to face recognition and acous-


tic speech recognition, but similar voice-recognition responses have been found in the

acoustic processing areas of the brain to that of face recognition in the visual cortex,

suggesting that acoustic recognition operates in a similar manner [9]. Recently, von

Kriegstein et al [180] have shown that recognition of familiar voices also activates a

region of the brain normally associated with face recognition, even in the absence of a

visible face, suggesting that some processing may be shared between the two modali-

ties.

Studies of the human perception of faces has generally focused on the recognition of

static faces [199] and it has been found that hair, the face-outline, the eyes and the

mouth are all very important for human face recognition [167, 17], but that the top

half of the face as a whole is more important than the lower [167]. However, recent

studies into the recognition of moving faces have demonstrated an improvement in

person recognition over static face images [128, 139, 160], particularly if the move-

ments are within the face region, such as speech or expression variation, rather than

movements of the face as a whole [94]. Indeed, Knappmeyer et al. [91] have demon-

strated that a novel face can be easily confused with a similarly-looking familiar face

if the characteristic movements of the familiar are transferred to the novel, suggesting

that the characteristic movement of a person’s face is an important factor in person

recognition.

While these studies have shown that recognition improved in the presence of a speak-

ing face, the only stimulus presented to the participants was the moving video, and

no complementary acoustic stimulus was supplied alongside to study the effect of

having both modalities available on human person recognition. One early study of

cross-modality person recognition was based on priming studies, where a priming

stimulus is used to influence a future target recognition. In this study, performed in

1997 by Schweinberger et al. [166], it was shown that a face prime appeared to fa-

cilitate improved recognition of a celebrity’s voice, even with a long (30 minutes of

other stimuli) interval between the prime and target. A similar effect was not found

by priming with the name alone, suggesting that there was a perceptual rather than


semantic effect present.

Only recently have perception studies been performed that looked at the combination

of both modalities and their effect on person recognition. The first study in this area

was by Kamichi et al. [85] in 2003, who found that participants could match unfamil-

iar faces and voices above the level of chance, suggesting that the movement of the

face and the acoustic signals are correlated in a manner that can be recognised by the

participants. Based on these results, Kamichi et al. suggested that the movements

of the face during speech contain dynamic information about speaker identity. How-

ever this study did not directly investigate the recognition of familiar speakers using

audio-visual stimuli.

The only existing study in the literature of audio-visual person recognition by humans

was published by Schweinberger et al in 2007 [165]. Attempting to rectify the lack of

anymulti-modal studies in the literature, they conducted an exhaustive comparison of

human recognition of people under 14 different conditions based on four underlying

variables:

• familiarity of the face,

• whether the face is dynamic or static,

• familiarity of the voice, and

• whether the voice presented matches the face

Participants were presentedwith a face and audio stimulus for around 2 seconds, and

were asked to judgewhether theywere familiar with the faces presented or not. Partic-

ipants were encouraged to submit their answer as quickly and accurately as possible,

and their response time was recorded alongside their answer.

From the results of their experiments, Schweinberger et al. concluded that recognition

of familiar voices was faster and more accurate when the matching face was shown,

2.3 Automatic audio-visual speech processing 21

and that the performance was degradedwith an incorrect face was shown, when com-

pared to a baseline audio-only recognition. They found that the improvement in per-

formance and speed was much larger for the dynamic faces, but was still present for

the static faces. Additionally, it was found that when an unmatched face was pre-

sented against a familiar voice, it was easier to ignore if static, but cause significant

degradation if dynamic. Similar trends were found in the results for the unfamiliar

voices, but the overall results were not significantly better than the baseline audio

performance.

However, while Schweinberger et al. compared the audio-visual recognition against

an audio-only baseline, they did not evaluate a similar video-only baseline, limiting

the conclusions that can be drawn from the research about the benefits of acoustic

information in addition to face-based person recognition. While Schweinberger et

al.’s [165] research and the improvement gained through dynamic versus static face

recognition [94, 91] demonstrate that human recognition of speakers improves as more

speech-related information is available, no conclusive research has yet shown that

acoustic information improves person recognition in the presence of dynamic visual

information, although it seems sensible that it should.

2.3 Automatic audio-visual speech processing

Because it was clear from human studies that the audio signal contains most of the

speech information, most early research into speech-based human computer inter-

faces (HCIs) was based around automatic acoustic speech processing. This area is

very mature and many commercial implementations are now deployed making use

both speech recognition and speaker recognition technologies in well controlled con-

ditions, such as limited speech vocabularies or in well-controlled environments. One

pertinent example of commercially deployed speech recognition systems would be

the replacement of touch-tone phone menus with automatic speech prompts, where

the limited vocabulary reduces the difficulty of the speech recognition task signifi-

22 2.3 Automatic audio-visual speech processing

cantly [31]. Commercial speaker recognition systems are not quite as widespread as

speech recognition, but they do have application in forensic and security work. An

example of such a system would be Hollien’s SAUSI (Semi-automatic Speaker Identi-

fication) system designed expressly for the use of forensic phoneticians [74].

One of the factors that is holding back widespread adoption of automatic speech pro-

cessing systems is the susceptibility of acoustic speech to environmental noise, which

can degrade performance by many orders of magnitude [66]. One of the obvious pos-

sibilities to improve acoustic speech processing systems–and one that is clearly moti-

vated by human perception studies–is to introduce visual information to existing au-

dio speech processing systems. Because the visual information contains complemen-

tary information to the acoustic this should improve the systems performance, partic-

ularly in the kinds of environments that existing acoustic systems perform poorly. The

introduction of visual information to acoustic speech processing systems leads to the

the research area of audio-visual speech processing (AVSP) covering the related areas

of audio-visual speech and speaker recognition (AVSR and AVSPR).

2.3.1 Audio-visual speech recognition

Research into the automatic recognition of human speech has been an ongoing area

of research since the end of Word War II, with the rapid growth of military and civil

radio for aviation and other purposes providing the motivation to ease the workload

of radio operators. Two of the earliest attempts of automatic speech recognition were

conducted independently by Davis et al. [33] at MIT in 1952 and Olson and Belar [127]

at RCA Laboratories in 1956. Both of these efforts focused on recognising a limited vo-

cabulary of words using spectral measurements of the acoustic signal captured using

analog filter banks.

Between these early efforts and the late 1970s the field of acoustic automatic speech

developed considerably, with many more small-vocabulary systems being developed,

and the pioneering of many modern speech recognition techniques such as dynamic


time warping [178] and linear predictive coding of acoustic features [77]. In the 1970s,

early research into large vocabulary speech recognition was begun at IBM [81], and

efforts towards truly speaker-independent speech recognition systems were begun by

AT&T’s Bell Labs [152]. Themain focus of the 1980s was on the recognition of continu-

ous speech, rather than the isolatedword recognition that was themain focus of earlier

efforts, spearheaded by Carnegie Mellon University’s early work in the 1960s [155].

The continuous speech focus was accompanied by a widespread shift from template-

matching methods to statistical modellingmethods, in particular the use of the hidden

Markov model (HMM) to easily negotiate connected word and phone networks [153].

Alongside this maturing of the acoustic speech recognition field, and motivated by

human perception studies, the first automatic audio-visual speech recognition system

was developed by Petajan in 1984 [136]. Petajan’s system extracted geometric param-

eters (height, width, and perimeter) from black and white images of the speaker’s

mouth region and used dynamic time-warping and template matching to recognise

words using these features. Later research by Petajan et al. found that the binary

image data outperformed the geometric features [135].

Much of the feature extraction research for visual speech features followed similar

work in face recognition. This was evident in the Bregler and Konig’s adaptation of

eigenfaces [175] to create eigenlips features for AVSR [16] in the mid-90s, as well as the

further extension of these feature using linear discriminant analysis by Duchnowski

et al. [47] to improve speech-discrimination performance. The modelling techniques

used for visual speech recognition tended to follow that of acoustic speech recognition,

with early systems focusing on template matching [136] and neural networks [16], but

HMMs rose to prominence to become the de-facto standard [65, 121] around the mid-

90s.

While the early research into AVSR mainly focused on recognising visual speech on

its own, the performance obtained using such a design could not match that of audio

alone. Given that human perception studies had shown that best recognition per-

formance could be obtained through a combination of both, research into fusing the


acoustic and visual information was of paramount importance to the development of

useful AVSR systems.

The earliest attempt at combining the two modalities was performed by Yuhas et al

in 1989 [195]. Yuhas et al’s system used a neural network with the pixel values of

the lip region as inputs to attempt to estimate the acoustic spectrum based upon the

visual information. This estimated spectrumwas then combinedwith the true acoustic

spectrum, weighted manually according to the acoustic noise level. This combined

spectrumwas then fed into a regular acoustic vowel recogniser, and performance was

found to improve upon the acoustic only result.

This early effort was followed in the early 1990s with research on combining the acous-

tic and visual modalities using time-delay neural networks [169, 47], followed by

many papers looking at various fusion techniques using the now prominent HMMs

as the basis of modelling [110, 168, 1, 140].

Most early efforts at audio-visual fusion developed in the 1990s focused on either

combining audio and video features before classification, or combining the results of

separate classifiers. These two approaches are referred to as early or late integration

respectively. Recent research in AVSR research has focused on modelling techniques

that can be considered to be a compromise between these two approaches. Most of

these approaches focus on a variant of multi-stream HMMs, of which the simplest is

the synchronous HMM [145], the subject of this thesis. More complicated approaches

have been developed [125, 145, 12], intended mostly to deal with the asynchronous

nature of audio-visual speech, but their training and testing complexity has limited

their application for real-world use.

Reflecting thematurity of AVSR research in the last decade, a number of review papers

are solely focused on the topic, with the earliest by Chen and Rao in 1998 [24], and

more recent research has been covered by Chibelushi et al. [25] in 2002 and Potamianos

et al. [145] in 2003. Most recently, the MIT Press have published an entire book solely

devoted to AVSP research [177].


2.3.2 Audio-visual speaker recognition

The earliest work on acoustic speaker recognition came from the idea of forensic voice-

print identification from Sonographs, first studied by Kersta [88] , based on earlier work

done during World War II by Ralph Potter and colleagues at Bell Laboratories [150].

Whilst Kersta’s paper is not entirely clear on the methodology [74], it appears that he

found that his fellow staff members could recognise a person by their Sonograph with

99% accuracy [88].

Research on true automatic speaker recognition started in the 1970s with Atal’s work

on text-dependent recognition based on Cepstral features [4], and techniques and

methods tended to be shared and follow along with speech recognition research [18].

Similar to speech recognition research, by the mid 1990s most speaker recognition re-

search had settled on using HMMs or GMMs to model Cepstral-based acoustic speech

features [190, 158].

The earliest effort in attempting to recognise a speaker by both the acoustic and vi-

sual modalities was performed by Wagner and Dieckmann in 1994 [181]. This system

used optical flow to represent the visual features and a frequency representation of

the audio in synergetic classifiers before combining the result. They found the motion

features to work better than the acoustic but couldn’t get an improvement through

fusion of both classifiers.

Luettin et al [109] were the first researchers to use theHMMs for text-dependent recog-

nition of speakers from lip images, using contour-based feature extraction on the lip

region. Contour-based features were popular at the time in AVSPR research [189,

27, 5], in part encouraged by the release of the DAVID audio-visual database [26],

which had blue-highlighted lip images of speakers. Hybrid features, incorporating

both contour and intensity information, were also investigated showing improved

performance over contour alone [84, 185]. Jourlin et al’s paper [84] also showed the

first combination of acoustic and visual features within the HMM approach, which

served as the basis of much future AVSPR research.


Most avenues of research continued to focus on simple fusion techniques until early in

the new century, when multi-stream HMMs where introduced for the AVSPR task by

Wark et al [188]. Research continued to grow intomethods of handling bothmodalities

simultaneously, in particular handling the asynchronous nature of audio and video

speech events, with the introduction of the coupled HMM [57] and the asynchronous

HMM [11] for AVSPR.

While AVSPR research is interested in recognising persons whilst they are speaking,

there is still a significant of static face information available in most applications that

can be used for traditional face recognition, in addition to the acoustic and visual

speech features. Some recent examples of such hybrid systems are those developed

by Fox et al [53] and Nefian et al [123]. Both of these systems have shown an im-

provement by bringing back the static face information that was discarded when only

considering the mouth region for visual speaker recognition.

2.3.3 Comparing speech and speaker recognition

As the two fields of speech and speaker recognition are very closely related, there have

been a number of efforts in the literature to compare and contrast the two tasks under

similar conditions. In particular two researchers in the field have published complete

theses covering both fields of research, Luettin in 1997 [111] and Lucey in 2002 [105],

that provide a good summary of the two fields at the time they were published.

While a number of review papers have been published over both speech and speaker

recognition [23, 25], little experimental comparisons of the two fields was conducted

until themost recent half-decade or so. In 2003, Nefian and Liang [122] and Lucey [106]

both published papers comparing speech and speaker recognition for audio-visual

speech. Results from both papers appear to show the visual modality is much closer

in performance to the acoustic modality for speaker recognition than for the recogni-

tion of speech, although neither papers particularly emphasizes this point. A similar

conclusion can also appear to be drawn from Bengio’s 2004 comparative paper [12].

2.4 Audio-visual databases 27

While these three papers have looked at both speech and speaker recognition under

similar conditions, none drew any conclusion as to the comparative suitability of ei-

ther modality for speech or speaker recognition.

2.4 Audio-visual databases

One of the limiting factors on AVSP research is the limited availability of suitable

databases, especially when compared to similar databases for audio-only speech pro-

cessing. While this is partly due to AVSP being a newer area of research, the main

reason for the sparseness of audio-visual databases is the difficulty in collecting, stor-

ing and distributing audio-visual data. For example, a typical audio-visual utterance

stored in a compressed video format might be 20-30 times as large as an equivalent

audio-only utterance. If the video data is not compressed, such as during data collec-

tion, then the difference is even more dramatic. Add in the difficulty of distributing

this volume of data to researchers and it can be seen that storage size has been (and

continues to be) a severe limiting factor on the development of audio-visual databases.

Due to these limitations, most early audio-visual databases were either designed for

a single task, or very limited in scope. However, as the costs of processor speed and

storage have steadily decreased size has become less of an issue, and more databases

have recently become available that are suitable for more general research. This sec-

tion will begin with a brief review of audio-visual speech processing databases, and

finish with an examination of the XM2VTS database [119], which will be used as the

basis for the experiments performed in this thesis.

2.4.1 A brief review of audio-visual databases

Most of the early audio-visual speech processing research focused on the speech recog-

nition task, and early databases were generally only designed to show the utility of

28 2.4 Audio-visual databases

audio-visual speech on a single speaker [136, 168, 28]. When audio-visual speaker

recognition was studied, the speech was typically limited to a single short phrase over

a small number of speakers [182]. Most of these databases were collected directly by

the researchers involved and generally were not widely distributed due to their lim-

ited utility and large (for the time) size.

Starting in the mid 1990s a number of larger multi-speaker databases were released,

such as the Tulips 1 [121] and DAVID [26] databases. By allowing these databases

to be used outside of their creators, speech and speaker recognition performance was

able to be compared by different researchers on the same databases. However, the size

of these databases were still limited compared to the far more abundant audio speech

databases available at the time, typically with only 10 to 30 speakers and a very limited

vocabulary.

TheM2VTS [137] database was released and then extended into the XM2VTS database [119]

in the late nineties, and had proved very popular for audio-visual speech research.

While the vocabulary of the XM2VTS database was still relatively limited, the large

number of speakers available (295) has provided a much more robust research base

for both speech and speaker recognition research, and it is currently the largest pub-

licly available audio-visual database with around 30 hours of speech. As it will serve

as the basis of the research in this thesis, the XM2VTS database will be examined in

more detail in Section 2.4.2.

The XM2VTS database has served as a useful benchmark for audio-visual speech re-

search, but its vocabulary is limited to English digits and a single phonetically bal-

anced phrase. The VidTIMIT database [163] has been recently released to examine

audio-visual speech over a wider vocabulary by having 43 speakers say 10 phoneti-

cally balanced phrases selected from the TIMIT [79] acoustic speech database. Inspired

by the VidTIMIT database, the AVTIMIT [68] database was collected with 223 speak-

ers of 20 TIMIT phrases. While these databases are certainly a good start towards large

vocabulary audio-visual speech processing, their relatively small size (40 minutes for

VidTIMIT and 4 hours for AVTIMIT) puts limitations on their utility for developing

2.4 Audio-visual databases 29

reliable audio-visual speechmodels. To date, themost extensive database available for

large vocabulary audio-visual speech recognition is the IBM Via Voice database [126],

with 50 hours of audio visual speech collected over 290 speakers. Unfortunately due

to commercial restraints this database is not available publicly leaving most research

to be performed on the smaller publicly available databases.

Until quite recently, most audio-visual speech databases have consisted of data col-

lected in clean studio conditions. While this has been useful for the study of audio-

visual speech, more realistic conditions are required to demonstrate the efficacy of

audio-visual speech in the real world. Some examples of recent databases designed

to study more real-world conditions are the CUAVE [133] database which deals with

problems in face and pose tracking, the AVICAR [96] database looking at audio-visual

speech recognition in automotive environments, and the IBMSmart Roomdatabase [148]

focused on meeting room environments. The office-environment-based BANCA [6]

database also looks promising, but hasn’t yet been released in a usable form for audio-

visual speech research.

The recent reduction in distribution and storage costs has allowed some of the more

recent small audio-visual databases to be released to interested researchers at very

low, or even no cost, in the hope of wider use by the audio-visual speech research

community. Some examples of this are the CUAVE dataset [133], VidTIMIT [163] and

the Australian English dataset AVOZES [62]. While these databases are not as large

as XM2VTS, the releasers of these databases hope that the very low cost (or no-cost)

will encourage their wide distribution amongst researchers and subsequent use as a

benchmark for audio-visual speech research.

2.4.2 The XM2VTS database

The XM2VTS [119] database was released by the European M2VTS project (Multi

Modal Verification for Teleservices and Security applications) [138] with the aim of

extending their existing M2VTS database [137] into a large high quality audio visual

30 2.4 Audio-visual databases

Figure 2.3: Some examples of raw frame images from the XM2VTS database [119].

database. Since its release the XM2VTS (extended M2VTS) database has continued to

be the largest publicly available audio-visual speech database, with around 30 hours

of raw video available. The only audio-visual database which is larger is IBM’s Via-

Voice database [126], which has not been made available to the research public.

The XM2VTS database consists of 295 participants speaking 3 distinct phrases. These

phrases are the same throughout all speakers and sessions and are

1. “0 1 2 3 4 5 6 7 8 9”

2. “5 0 6 9 2 8 1 3 7 4”

3. “Joe took fathers green shoe bench out”

The speech events were arranged into two ‘shots’ per session, where each of the three

phrases are spoken for each shot. Four sessionswere recorded in total over a period of

five months to capture the natural variability of speakers over time. Each of the shots

were recorded in studio conditions with good illumination and a blue background

suitable for chroma-keying. Some examples of such frames from the database are

given in Figure 2.3.

The XM2VTS database was primarily designed for the speaker recognition task, and

a speaker-verification protocol [107] was released alongside the database to enable re-

searchers to benchmark performance easily. In the protocol the 295 speakers of the

database were split into 200 clients and 95 impostors. Two configurations were cre-

2.5 Chapter summary 31

Figure 2.4: Configurations for person recognition defined by the XM2VTS proto-col [107].

ated, defining which sessions were used for training, evaluation and testing of the

speaker verification system, which are shown in Figure 2.4. The second configuration

will serve as the basis of the speech processing framework used for the experiments

performed in this thesis, but adapted such that it can be used for both speech and

speaker recognition.

2.5 Chapter summary

This chapter has provided a concise summary of the field of AVSP, covering both areas

of speech and speaker recognition. Both human-based and automatic speech process-

ing research was reviewed to introduce the fundamental concepts involved in AVSP

research.

A review of the existing literature in human production and perception of speech

and speakers was conducted, including the benefits of speech-related movement for

human recognition of faces. The speech perception studies clearly show that while

speech production itself is primarily an acoustic process, it does have visual side-

effects that humans have come to rely upon to improve their perception of each others

speech. In particular, studies have shown that even with clearly articulated speech,

human listeners could recognise speech with higher accuracy than the audio alone.

halla

This figure is not available online. Please consult the hardcopy thesis available from the QUT Library

32 2.5 Chapter summary

Studies of human recognition of speakers based on audio-visual speech were very

limited in the literature, but a number of studies showed that human recognition of

faces was improved with speech-like movement. One recent significant study has

shown that recognition of familiar voices was faster when the correct face was shown,

suggesting that a combination of acoustic and visual speaker recognition occurs when

both are available for human recognition of speakers.

In the final sections of this chapter a brief history of automatic speech and speaker

recognition systems was presented, along with a review of databases available for

audio-visual speech processing. Major publications of importance in both fields were

indicated, as well as the cross-over between each of these fields, as well as the closely

related fields of acoustic speech and speaker recognition.

Chapter 3

Speech and Speaker Classification

3.1 Introduction

Classification is the process of assigning input features into one of a finite number

of classes. For the tasks of speech and speaker classification, these classes are either

speech events or speaker’s identities respectively. When tested against a particular

sequence of features, classification is generally given as a score representing the likeli-

hood of the features belonging to the class represented by a particular classifier. This

score can then be compared to other classifier scores to make a decision on the most

likely class of a particular set of data. Before the classifier can make such a decision,

they need to be trained onmany sets of features that are typical for a particular class so

that accurate classification can occur. This training process is conducted on a separate

set of data to that which the classifiers will eventually be used upon.

This chapter will focus on classification methods which are suitable for implement-

ing speech models suitable for use in modelling acoustic and visual speech features.

Both Gaussian mixture models (GMMs) and hidden Markov models (HMMs) will be

introduced as speech classifiers that have shown good performance in the existing

literature at modelling human speech events. In addition to training these models di-

34 3.2 Background

rectly, maximum a posteriori (MAP) adaptation will also be introduced as a technique

to allow speaker dependent speech models to be trained with limited data.

3.2 Background

The goal of classification is to divide a multi-dimensional feature-space into regions

based upon whether a particular point, or observation, in that space belongs to a par-

ticular class [59]. For speech recognition these classes would correspond to words or

sub-words, whereas a speaker recognition classifier would be choosing amongst sep-

arate classes for each speaker. Classifiers can used to either choose amongst many

different classes, or can make a binary accept/reject decision for a single class.

Ideally classes should be completely separate within the feature space, allowing clas-

sifiers to unambiguously determine which class any particular point in feature space

would correspond to. Unfortunately, this is not the case in the real world, so the aim

of classifier design is to reduce the classification error. As the classifiers are trained on

known data, this classification error can easily be calculated as the number of feature-

space points placed in a class that does not match the labelled class.

3.2.1 Bayes classifier

The Bayes classifier [59] is a theoretical classifier that provides the best performance

for any pattern recognition application by minimising the probability of classification

error. Bayes classification is based upon the assumption that the observations for a

particular class can be modelled as a random variable with a known probability dis-

tribution. Bayes theorem defines the posterior probability of observation o being in a

particular class ωi as:

3.2 Background 35

P (ωi|o) =p (o|ωi)P (ωi)

p (o)(3.1)

where p (o|ωi) is the class conditional probability density function for observation o in

class ωi, P (ωi) is the a priori probability of class ωi and p (o) is the probability density

function for observation o.

For the purposes of choosing between a number of classes only the numerator of (3.1)

is important as the denominator is identical for all classes ωi. Therefore given two

classes ω1 and ω2, a classification decision can be made as:

Assign ω →

ω1 p (o|ω1)P (ω1) > p (o|ω2)P (ω2)

ω2 otherwise(3.2)

Therefore choosing the most likely class for a particular observation is simply a matter

of sorting the numerators of (3.1) and choosing the highest score [48].

While the Bayesian classifier can theoretically provide the best performance of any

classifier, it does require that the P (ωi) and p (o|ωi) are known for every class ωi.

While P (ωi) can be calculated easily given enough training data [59], the probability

density function p (o|ωi) must be estimated based on a training set of observations

for each class. Clearly the more training observations that can be obtained, the better

the true p (o|ωi) can be modelled, and the more Bayesian classifier performance will

approach the theoretical maximum. [105].

3.2.2 Non-parametric classifiers

One of the simplest methods of estimating the class density function p (o|ωi) is using a

non-parametric classifier. Non-parametric classifiers are so called because they make

no major assumptions of the underlying form of the class distributions, and therefore

36 3.2 Background

are not represented using parameters of any particular modelling technique. Non-

parametric classifiers generally compare a test observation directly with the known-

class training observations to determine the class under test.

The simplest, and classical, implementation of a non-parametric classifier is the nearest

neighbour classifier. This classifier works by choosing the class of the closest (or a

majority of the k-closest) training observation to the test observation. This class is

then return as the most likely class for the test observation.

Non-parametric methods can be very useful when the training data available for each

class is limited, such as face recognition. For applications where a reasonable amount

of training data is available for limitation of having to store and compare every train-

ing observation comes into play [105].

3.2.3 Parametric classifiers

Parametric classifiers are designed such that they make some assumption about the

form of the classes within the feature space, and the training process consists of esti-

mating the parameters of an assumed modelling technique [48].

As some assumption is made about the approximate nature of observations within

classes, a large number of training observations can be reduced to a relatively small

number of parameters defining the form of the assuming modelling technique. Fur-

thermore, since p (o|ωi) is calculated directly for each class, statistical methods can be

used to form the best models for each class [105].

For speech processing work, it is generally assumed that observations for a particular

class can be considered to be Gaussian about a relatively small number of points in

feature space [194]. This assumption has led to the development of GMMs andHMMs,

which are the basis of static and dynamic speech processing respectively, and will be

covered in detail for the remainder of this chapter.

3.3 Gaussian mixture models 37

3.3 Gaussian mixture models

Gaussian mixture models (GMM) are a modelling technique that have been exten-

sively used for general pattern recognition research [59, 48]. As the name implies,

GMMs model classes with a weighted sum of Gaussian probability density functions

in feature-space.

The use of Gaussian models is encouraged by the Central Limit Theorem [58] which

states that a large number of measurements subject to small random errors will lead

to a Gaussian, or normal, distribution. Because such measurements are very common

in nature and other complex systems, Gaussian distributions, and therefore GMMs,

are well suited to representing complex variables. In particular GMMs have shown

good performance for text-independent acoustic speaker recognition [158].

GMMs are defined by a weighted sum of M Gaussian density functions, given by:

p (o|ωi) =M

∑i=1

cibi (o) (3.3)

where o is a D-dimensional observation vector, bi (o) is the Gaussian density function

for mixture i, and ci is the weight of mixture i. The weights ci must sum to unity over

all mixtures, ∑Mi=1 ci = 1. Each bi (o) is a D-variate Gaussian function of the form

bi (o) = N (o,µµµi,Σi) =1

(2π)D/2 |Σi|12

exp{

−12

(o− µµµi)′Σ−1i (o− µµµi)

}

(3.4)

with mean vector µµµi and covariance matrix Σi for Gaussian i determined during train-

ing of the GMM.

38 3.3 Gaussian mixture models

3.3.1 GMM complexity

Because the form of the distributions is assumed (i.e. Gaussian), a GMM can be com-

pactly defined by a single parameter vector, λλλ, consisting of the weight, mean and

covariances for each of the M mixture components:

λλλ = [λλλ1, . . . ,λλλM] = [c1,µµµ1,Σ1, . . . , cM,µµµM,ΣM] (3.5)

This representation is clearly much more compact than would be required in a non-

parametric classifier, which generally store every training observation. This simplified

form allows statistical methods to be used to determine the optimal λλλ for a given set

of data representing the class in training. However, GMMs can still be quite complex

in terms of the number of free parameters in λλλ. While this complexity may provide

bettermodelling of the idiosyncrasies of the class under training, this complexity must

be traded off with the volume of training observations required to support it. A num-

ber of decisions can be made to reduce the complexity of a GMM without degrading

speechmodelling performance greatly, generally related to the form of the covariances

and the topology of the GMM.

The first decision is choosing the form of the covariance matrix Σi. In the general case

the covariances between all D dimensions of a collection observation vectors can be

represented by a full D × D covariance matrix. These covariance matrices can be of

the following form:

1. Nodal, where each Gaussian (node) has its own covariance matrix

2. Grand, where all Gaussians within a GMM share a single covariance matrix

3. Global, where all Gaussians within all GMMS share a single covariance matrix

Nodal covariance models are typically chosen as they allow each Gaussian to individ-

ual choose the best covariance, but the other options can be useful when training data


is limited.

Additionally, rather than training a full covariance matrix, the data and training re-

quirements can be reduced by only training a diagonal covariance vector, and setting

all inter-dimensional variances to zero. The use of a nodal, diagonal covariance vec-

tors has been shown empirically to provide the best performance for most speech

applications [158].

The second decision involves choosing the topology of the GMM. In model design

topology generally refers to the top-level layout of the classifier, which in GMMs boils

down to choosing the number of Gaussians, defined as M above. The choice of M

comes down to a simple trade-off between the complexity of the classifier and the

amount of training data available. If the GMM is too complex (M too large), it may

over-fit to the training data, impairing the models performance on unseen data, how-

ever if the GMM is too simple (M too small) it may not model the variety of the train-

ing observations adequately. Unfortunately, there does not exist a known theoretical

method of calculating the optimal value of M prior to performing the Gaussian train-

ing. M is therefore chosen through heuristic and empirical evidence based on the final

model performance on unseen data [105].

3.3.2 GMM parameter estimation

Once the covariance-form and topology of theGMMhas been chosen, theGMM can be

trained by determining the values of λλλ that best fit the training data, through a process

called maximum likelihood estimation. In maximum likelihood estimation, if we a set

of observations O of size N drawn from the class being modelled, O = {o1, . . . ,oN},then the likelihood of a given set of parameters λλλ producing that data set is given by:

L (λλλ|O) = p (O|λλλ) =N

∏i=1

p (oi|λλλ) (3.6)

40 3.3 Gaussian mixture models

The optimal parameters λλλ′ can then simply be expressed as:

λλλ′ = argmaxλλλ

L (λλλ|O) (3.7)

However, this does not specify how the parameter λλλ is varied to determine the maxi-

mum likelihood, which is a non-trivial problem for any number of Gaussian mixtures,

M > 1. While a single Gaussian’s parameters could be determined directly from ex-

amining the data, the calculation of the parameters of a multiple number of mixtures

must be calculated through a more elaborate process, of which one popular method is

known as expectation maximisation (EM) [13].

EM is an iterative process used to improve a parameter vector based upon the likeli-

hood of observations being fitted by said vector. For GMM training, EM is performed

by maximising the parameters of each mixture individually based upon the training

observations that suit each individual modality. The EM algorithm consists of four

stages, which are given for the training of GMM mixture i here:

1. Initialisation: set λλλ{0}i to initial value, set t = 0

2. Expectation: calculate L(

λλλ{t}i |O

)

3. Maximisation: λλλ{t+1}i = argmaxλλλi

L(

λλλ{t}i |O

)

4. Iterate: t = t + 1, repeat from step 2 until L(

λλλ{t}i |O

)

− L(

λλλ{t−1}i |O

)

≤ Th or

t < T

where λλλ{t}i is the estimation of λλλi at step t, Th is a predefined convergence threshold,

and T is the maximum number of iterations permitted.

Before the EM algorithm can be applied a good ‘first guess’ of each mixture’s parame-

ter vector, λλλ{0}i = {ci,µµµi,Σi}, must first be provided to serve as a starting point for the

expectation and maximisation steps. The initial parameters can be chosen based on


a random selection from the training observations, but best performance is normally

obtained using a non-random initialisation process. The most common initialisation

method uses k-means clustering [59, 2] to choose M clusters from the training ob-

servations and initialise λλλ{0}i based on one mixture for each cluster. All GMM-based

experiments conducted in this thesis are initialised in this manner.

The expectation step of the EM algorithm determines the likelihood of the current

parameter vector λλλ{t}i fitting each observations in the training set, on ∈ O. This is

calculated based on a mixture-normalised likelihood :

li (n) =cibi (on)

∑Mk=1 ckbk (on)

(3.8)

where L (λλλi|O) ≈ ∏Nn=1 li (n).

Once the likelihoods have been calculated, the parameters can be recalculated using

li (n) to determine the likelihood that observation on is covered bymixture i, under the

previous choice of parameters. As a single mixture only comprises a mean, covariance

and weight parameter, these parameters can be calculated using standard statistical

methods [105]:

µµµi =∑

Nn=1 li (n)on

∑Nn=1 li (n)

(3.9)

Σi =∑

Nn=1 li (n) (on − µµµi) (on − µµµi)

′

∑Nn=1 li (n)

(3.10)

ci =1N

N

∑n=1

li (n) (3.11)

Finally, the iterative step compares the likelihoods using the new parameters to the

old likelihoods to decide if the EM algorithm has converged to a maxima, at which the

42 3.4 Hidden Markov models

Figure 3.1: A Markov process can be modelled as a state machine with probabilistictransitions (aij) between states at discrete intervals of time (t = 1,2, . . .).

algorithm will conclude.

3.4 Hidden Markov models

Hidden Markov models (HMMs) are a well-establish mathematical tool for establish-

ing a statistical model of temporal observations. Whereas the GMM introduced in the

previous section model individual observations independent of each other, HMMs are

designed to treat observations as a sequence in time or space. While spatial HMMs

can be useful for applications such as handwriting recognition [3], most applications

of HMMs involve temporal observations, of which speech is a very common applica-

tion [194]. For this reason, HMMswill be introduced in this section, and used through-

out this thesis, as a temporal model.

3.4.1 Markov models

HMMs are designed to model sequences of observations based on the underlying as-

sumption that these observations came about from a hidden state machine, where the

parameters of this state machine are not known. In the underlying state machine, re-

ferred to as a Markov chain or model [153], the states can change based on statistical

probabilities at discrete points in time, as shown in Figure 3.1.

3.4 Hidden Markov models 43

Markov models can be used to determine the likelihood of a particular sequence of

events, given a particular path through the network. Given a Markov model with S

states, the parameters of the model can be defined as

λλλ =[aij : 1≤ i ≤ S, 1≤ j≤ S

](3.12)

And if the known path, or sequence, through the network is given as

q = [q1,q2, . . . ,qT] , 1≤ qt ≤ S (3.13)

where qt is the state occupied at time t, then the model parameters can easily be ex-

amined to determine the probability of path q being traversed.

If the probability of the initial state being defined as πi = P (q1 = i), then the probabil-

ity of Markov model λλλ producing sequence q can be given as the product of the initial

and transition probabilities:

P (q|λλλ) =T

∏t=1

P (qt|λλλ) (3.14)

= πq1aq1q2aq2q3 . . . aq{T−1}qT (3.15)

3.4.2 Hidden Markov models

While Markov models can be very useful in modelling observations where the under-

lying states can easily be determined from observations, in practise the actual state

sequence is unknown. This problem has led to the development of the HMM design,

where the underlying state sequence is said to be hidden.

Rather than being presented with a known state sequence q, a HMM works with a

sequence of observations vectors given by


O = [o1,o2, . . . ,oT] (3.16)

where ot is the observation vector at time t.

As the underlying state sequence q is unknown, the probability of getting this se-

quence from the HMMmust be evaluated over all possible q [105]:

P (O|λλλ) = ∑allq

P (O|q,λλλ)P (q|λλλ) (3.17)

It can be seen that in addition to the sequence probability P (q|λλλ) given in (3.15), the

output emission probabilities density P (O|q,λλλ)must also be calculated over all se-

quences. Given the assumption of state-independence, this probability can be repre-

sented as a conditional density function with no loss of accuracy [105]. This density

function can then be expressed as the product of state-specific emission densities over

time as follows:

p (O|q,λλλ) =T

∏t=1

p (ot|qt,λλλ) (3.18)

=T

∏t=1

bqt (ot) (3.19)

where bi (o) is the output-emission probability density function of state i.

Therefore to fully represent a HMM, the parameter vector λλλ must also contain these

output density functions in addition to the state transition likelihoods:

λλλ = [A,B] (3.20)

=[aij,bi (o) : 1≤ i ≤ S, 1≤ j ≤ S

](3.21)


It must be noted that this particular implementation of the state-based probabilities is

for a continuousHMMs, as the model works on the actual continuous, real-valued ob-

servations rather than quantising the observations into discrete symbols before mod-

elling as in a discrete HMM. Continuous HMMs have been shown to provide much

better performance for speech recognition tasks than discrete HMMs [153].

At a basic level HMMs can be viewed as a temporal structure, and the choice of mod-

elling technique for bj (o) is not dictated by the HMM framework in any way. How-

ever, in practice, continuous HMMs are typically implemented with the output den-

sity functions being represented by a GMM for each state:

bj (o) =M

∑m=1

cjmbjm (o) (3.22)

=M

∑m=1

cjmN(

o,µµµjm,Σjm

)

(3.23)

where cjm, µµµjm, and Σjm are the weights, mean vector and covariance matrix respec-

tively of each mixture in the GMM. More details on the implementation of GMMs can

be found in Section 3.3, where they are covered as a classifier in their own right.

Therefore with the complete specification of the HMM parameters λλλ, the likelihood

of a particular sequence of observations O can be calculated by (3.17). However, the

need to calculate the probabilities over all possible paths through the state machine is

typically prohibitive, with the order of ST possible paths available. This calculation

can be simplified immensely if instead of calculating the probability over all possible

paths, only the probability of themost-likely path is considered. If this most-likely path

is referred to as q′ then

P (O|λλλ) ≈ P(O|q′,λλλ

)(3.24)

This simplification is referred to as the Viterbi approximation [194, 105], and can greatly


simplify the calculation of P (O|λλλ) with no significant loss in performance [105]. Of

course, the Viterbi approximation does require that the optimal path q′ can be calcu-

lated in some manner first, which leads to the Viterbi decoding algorithm, designed for

this very purpose.

3.4.3 Viterbi decoding algorithm

The Viterbi decoding algorithm is designed to find the most likely path q′ for a par-

ticular sequence of observations O through a HMM defined by λλλ, without having to

exhaustively search every possible path in the process. To simplify the task of choosing

the most likely path, the Viterbi decoding algorithm only calculates (and remembers)

the single most likely path to each state j at time t. Only S possible paths are kept

for each time step, rather than St that would be required for an exhaustive search,

with a corresponding increase in performance. While this approach is not guaranteed

to always find the best path due to this assumption, in practise this algorithm works

effectively for speech and other applications of HMMs [153, 194, 105].

The Viterbi algorithm can easily be represented as a recursive algorithm to calculate

both the probability of arriving in state j at time t through themost likely path, defined

as δt (j), and the previous state in that same path ψt (j).

Initially these parameters are defined for each state at time t = 1:

δ1 (j) = πjbj (o1) , 1≤ j ≤ S

ψ1 (j) = 0, 1≤ j ≤ S (3.25)

Then at each step t, the most likely previous state is chosen for each current state, and

the current probability is calculated:


δt (j) = max1≤i≤S

[δt−1 (i) aij

]bj (ot) ,

2≤ t ≤ N

1≤ j ≤ S

ψt (j) = argmax1≤i

[δt−1 (i) aij

] 2≤ t ≤ N

1≤ j ≤ S(3.26)

So at each t, the best path (i.e. the most likely previous state) for state j is in ψt (j) and

the probability of arriving at state j from that path is in δt (j). Therefore at the final

observation t = T, the final probability and precursor state are given by:

P′ = max1≤j≤S

[δT (j)]

q′T = arg max1≤j≤S

[δT (j)] (3.27)

Once the final path state q′T has been determined, the full path q′ = {q1,q2, . . . ,qT} can

be determined by backtracking through ψt:

q′t = ψt−1(q′t+1

), t = T − 1,T − 2, . . . ,1 (3.28)

In practise a closely related version of this algorithm is implemented using logarithms

of the parameters to simplify implementation in computer code, as the need for mul-

tiplications of very small probabilities can be eliminated.

3.4.4 HMM parameter estimation

The goal of HMM parameter estimation is to determine a HMM parameter vector λλλ,

defined in (3.20), based on a set of training observation sequences {O1,O2, . . . ,ON}.HMM parameter estimation can be considered to encompass that of GMM parameter

estimation covered in Section 3.3.2, but complicated further as each state of the HMM


has a GMM that requires parameter estimation in its own right. The individual GMM

parameter estimation is additionally made more difficult again as the alignment of

observations to state GMMs is not completely defined, and can change somewhat

during the training process.

A single HMM contains S states, represented by a single GMM each. As it has already

been established that a single GMM’s parameter vector is too complicated to optimise

analytically, having to estimate S of these does not make the process any easier. How-

ever, the underling general EM algorithm introduced in Section 3.3.2 can be applied to

the task of HMM parameter estimation in a similar manner. This specific instance of

the EM algorithm is called the Baum-Welch algorithm, and has been shown to provide

good performance for HMM training for speech and other applications [153, 194].

Like the more general EM algorithm, the Baum-Welch algorithm can iterative deter-

mine the λλλ that locally maximises L (λλλ|O) at each stage, thereby arriving at a suitable

λλλ for the models being trained. This algorithm will be introduced shortly, but firstly

a suitable initialisation or ‘best guess’ of λλλ must first be determined to providing a

suitable starting point. This initialisation is performed using the technique of Viterbi

training.

Viterbi training

For HMM parameter estimation, Viterbi training serves as the initialisation step of the

EM algorithm, providing a good ‘first guess’ of the HMM parameter vector λλλ. This

can be considered analogous to the application of k-means clustering in initialising a

GMM parameter vector.

The main task of Viterbi training is to align the observations in the training observa-

tion vectors against the correct state model, whereupon the state models can then be

estimated from the aligned observations. This alignment is performed by dividing the

observation vector into S equal segments at the initial stage, after which the best align-


ment, q′, is performed using the Viterbi decoding algorithm described in Section 3.4.3.

At this stage, the state-transition probabilities of the HMM can be estimated from the

Viterbi alignment. If Aij is the total number of transitions from state i to state j in q′

over all N training observation sequences, then the state-transition probabilities can

be estimated by

aij =Aij

∑Sk=1 Aik

(3.29)

Once each state has been aligned with the corresponding training observations, each

observation is assigned to a particular mixture in the state-model GMM, and each of

these mixtures parameters µµµi and Σi, as well as the mixture weights ci, can be cal-

culated using standard statistical methods similar to that of GMM training in Sec-

tion 3.3.2. This assignment of observation-to-mixture is performed using k-means

clustering on the first step, and thereafter by choosing the most likely mixture for

each observation.

Once the state-model GMM parameter vector has been calculated, a new alignment is

performed and the process begins anew. The Viterbi training process ends when there

is minimal change in the HMM parameter vector λλλ.

Baum-Welch algorithm

Once a good initial estimate of the HMMparameter vector λλλ has been provided by the

Viterbi training algorithm, the Baum-Welch algorithm is used to iteratively improve

this estimate. Being within the same class of EM algorithms, the Baum-Welch algo-

rithm is quite similar to Viterbi training. However, rather than assigning observations

definitively to states and mixtures, Baum-Welch re-estimation considers states and

mixtures to have soft boundaries and each observations is spread amongst all states

and mixtures based upon its likelihood of being in each. This likelihood is calculated


from both forward and backward probabilities of state-and-mixture occupation, lead-

ing to Baum-Welch algorithm’s other moniker, the forward-backward algorithm.

The forward probability, αj (t), is defined as the likelihood of a particular observation

sequence arriving at state j at time t, or more formally,

αj (t) = p (o1o2 . . .ot,qt = j|λλλ) (3.30)

This value can be calculated recursively by first defining αj (1) based on the initial

observation,

αj (1) = πjbj (o1) , 1≤ i ≤ S (3.31)

and can then be extended for higher values of t based on the previous αi (t) values and

the current observation:

αj (t) =

[S

∑i=1

αi (t− 1) aij

]

bj (ot) , 1≤ i, j ≤ S, 2≤ t ≤ T (3.32)

In a similar manner to the forward probability, the backwards probability is defined

as the likelihood of a particular observations sequence starting at time t and state j.

This can be expressed formally as

β j (t) = p (ot+1ot+2 . . .oT|qt = j,λλλ) (3.33)

and can also be calculated in a similar manner to αj (t). At time t = T, the likelihood

of reaching any particularly state is at unity:

β j (T) = 1, 1≤ j ≤ S (3.34)


The backward probabilities at earlier times can then be calculated recursively,

β j (t) =S

∑k=1

ajkbk (ot+1) βk (t + 1) , 1≤ j,k ≤ S, 1≤ t≤ T− 1 (3.35)

Because the forward probability has been defined as a joint probability, but the back-

ward as a conditional, both can be combined to determine the joint probability of being

in state j at time t within a complete sequence of observations,O = {o1,o2, . . . ,oT}:

p (O,qt = j|λλλ) = αj (t) β j (t) (3.36)

Using (3.36) we can define the likelihood of state j being occupied at time t for the nth

training observation sequenceOn in terms of the forward and backward probabilities:

Lnj (t) = p (qnt = j|On,λλλ) (3.37)

=p (On,qnt = j|λλλ)

p (On|λλλ)(3.38)

=1Pn

αnj (t) βn

j (t) (3.39)

where Pn = p (On|λλλ), which can be calculated based on the full iteration of either

probability:

Pn = αnS (T) = βn

1 (1) (3.40)

The state likelihood in (3.39) can be extended to mixture m within state j as

Lnjm (t) =

1Pn

[S

∑i=1

αni (t) aij

]

βnj (t) cjmbjm (on

t ) (3.41)


which serves as the expectation step for further EM under the Baum-Welch algorithm.

Once the likelihoods have been calculated for each state and mixture in each training

observation sequence, the maximisation step can proceed to estimate the new param-

eters,

λλλ = [A,B] (3.42)

=[aij,bj (o) : 1≤ i, j ≤ S

](3.43)

=[

aij, cjm, µµµjm, Σjm : 1≤ i, j ≤ S, 1≤ m ≤ Mj

]

(3.44)

which are individually estimated using standard statistical methods [194]:

aij =∑

Nn=1

1Pn

∑Tn−1t=1 αn

i (t) aijbj(ont+1

)βnj (t + 1)

∑Nn=1

1Pn

∑Tnt=1 αn

i (t) βni (t)

(3.45)

cjm =∑

Nn=1 ∑

Tnt=1Ln

jm (t)

∑Nn=1∑

Tnt=1Ln

j (t)(3.46)

µµµjm =∑

Nn=1 ∑

Tnt=1Ln

jm (t)ont

∑Nn=1 ∑

Tnt=1Ln

jm (t)(3.47)

Σjm =∑

Nn=1 ∑

Tnt=1Ln

jm (t)(

ont − µµµjm

)(

ont − µµµjm

)′

∑Nn=1 ∑

Tnt=1Ln

jm (t)(3.48)

The expectation (3.39 and 3.41) and maximisation (3.45 to 3.48) steps of the Baum-

Welch are then iterated until convergence of the parameters occur, at which point the

algorithm concludes.


Figure 3.2: A diagrammatic representation of typical left-to-right HMM for speechprocessing.

3.4.5 HMM types

Whilst it is possible for the states of a HMM to be interconnected in any manner, in

general some simplification is performed to make training and decoding of the final

HMM simpler. The structure of the HMMmodel can be simply realised by the transi-

tion matrix A =[aij : 1≤ i, j ≤ S

].

The general case is the ergodic HMM, where any state can be reached from any other

state in a single step (or, where aij > 0∀i, j), but for most speech processing tasks, the

left-to-right HMM has found to provide a better representation of human speech [153,

194]. In this form of HMM, connections can only be made form a lower to higher (or

same) state, or

aij = 0, j < i (3.49)

An example of a typical left-to-right HMM is shown in Figure 3.2, and the reason for

the name can be seen when the states are laid out by order of index.

The design of the left-to-right HMMputs a number of restrictions on the training (and

decoding) process whilst still adequately modelling the natural non-cyclic nature of

speech [105]. In particular, the single entry and exit states, and no backwards transi-

tions dictated by the left-to-right HMM can simplify the possible network paths for

EM considerably.

54 3.5 Speaker adaptation

3.5 Speaker adaptation

Using the training procedures outlined earlier in this chapter it is possible to train

GMM and HMM-based classifiers that perform well at their respective tasks, pro-

vided that there is enough training data to adequately estimate the parameters of these

models. While this is not generally a problem when training the speaker-independent

background models, it can be more difficult when training speaker-dependent mod-

els. Because the speaker-dependent models have a much smaller amount of data for

training, the estimation of the speaker-dependent models can be considerably more

difficult.

Speaker adaption is simply the process of adapting the parameters of a previous

trained set of models to a new set of observations. The adaptation process can be con-

sidered very similar to the EM algorithms used to train models from scratch, outlined

earlier in Section 3.3.2 for GMMs and Section 3.4.4 for HMMs, but the initialisation of

parameters already exists in the background models.

For this thesis the maximum a posteriori (MAP) method of adaptation will be used

to form the speaker dependent GMM and HMM models. MAP adaptation was cho-

sen over the other major alternative, maximum likelihood linear regression (MLLR)

adaptation, due to the improved performance of MAP on reasonable sized adapta-

tion datasets [98]. This performance benefit was also verified empirically using the

XM2VTS speech recognition framework determined in Chapter 4.

3.5.1 MAP adaptation

In the standard EM algorithm, first outlined in Section 3.3.2, the aim at each iteration

is to determine a new set of parameters λλλ′ given a fixed initial λλλ:

λλλ′ = argmaxλλλ

p (O|λλλ) (3.50)

3.5 Speaker adaptation 55

However, for MAPAdaption the initial λλλ is assumed to be a random vector with a cer-

tain distribution [98]. Additionally, there is assumed to be a correlation between the

parameter vector and the observations that let to it, such that the adaptation observa-

tions can be used to form an inference about the final parameter vector λλλ. If the prior

density of the parameter vector is given as g (λλλ) then theMAP parameter estimate can

be given by maximising the posterior density g (λλλ|O) [98]:

λλλ′MAP = argmax

λλλg (λλλ|O) (3.51)

= argmaxλλλ

p (O|λλλ) g (λλλ) (3.52)

It can be seen that if the prior density is constant (i.e. all parameter vectors are equally

likely) in (3.52) the MAP adaptation drops to the standard maximum likelihood rule

shown in (3.50). By simply using the MAP adaptation of the parameter vector, rather

than the simpler maximum likelihood estimation, in the EM algorithms for training

both GMMs and HMMs, MAP adaptation can be performed using the same iterative

framework.

MAP adaptation can also put additional restrictions upon which parameters can be

varied to simplify the adaptation process and ensure that over-fitting to the adapta-

tion doesn’t occur [194]. Typically for speech applications, and particularly for the

experiments performed within this thesis, only the means of the mixtures are adapted

to the adaptation dataset. Because of this, only the mean adaptation formulas will be

outlined below, and the reader interested in other parameter adaptations should refer

to Lee and Gauvain’s publications on the topic [97, 98].

The parameter estimation of the new mean parameter µµµjm for state j and mixture m

for MAP adaptation is given as [98, 194]:

µµµjm =Pjm

Pjm + τµµµjm +

τ

Pjm + τµµµjm (3.53)


where τ is a weighting parameter of the a priori model parameters, Pjm is the occupa-

tion likelihood on the adaptation data, given by

Pjm =N

∑n=1

Tn

∑t=1

Lnjm (t) (3.54)

µµµjm is the prior mean parameter, and µµµjm is the mean of the adaptation data, defined

as

µµµjm =∑

Nn=1 ∑

Tnt=1Ln

jm (t)ont

∑Nn=1 ∑

Tnt=1Ln

jm (t)(3.55)

It can be seen that if the likelihood of mixture occupation, Pjm, is small then the effect

of the MAP adaptation in (3.53) will be relatively minor, whereas mixtures that are

better represented by the new observations will have the highest change in their mean

parameters [194].

3.6 Chapter summary

This chapter has provided a summary of existing techniques for training and adapting

speech models suitable for implementing the models suitable for audio and visual

speech processing. An introduction to classifier theory introduced the theoretically

optimal Bayes classifier, followed by a summary of non-parametric and parametric

methods of estimating classifiers based on training data.

The first of the two main classifier types used for this thesis, the GMM classifier was

then introduced. GMM classifiers represent observations as collection of multi-variate

Gaussian distributions in the feature-representation space. A number of design deci-

sions in representing the parameters of GMMs were discussed, followed by an imple-

mentation of the maximum likelihood EM algorithm on estimating these same param-

eters.


The second main classifier introduced in this chapter was the HMM, which is used to

chain together a number of assumed static states in a temporal structure. Each of the

states can then be modelled with a single GMM, allowing the HMM to model both

the static and dynamic nature of human speech. The Viterbi decoding algorithm was

introduced, as well as the Viterbi and Baum-Welch EM-based training algorithms for

determining HMM parameters.

Finally in this chapter, the process ofMAP speaker adaptationwas presented to demon-

strate how speaker-specific models can be trained even when there is comparatively

little speaker-specific training data available, through the process of adapting the speaker-

specific models from speaker-independent models that can be trained over a much

larger dataset.


Chapter 4

Speech and Speaker Recognition

Framework

4.1 Introduction

Automatic speech and speaker recognition can be considered to be two highly related

activities, and many methods and techniques are common to both. In this chapter,

these two tasks will be clearly defined and a protocol will be developed to allow for

evaluating performance of both on the same datasets, sharing models and techniques

where appropriate. This protocol will be developed such that it can be used across a

wide range of modelling techniques and features, and will serve as the basis for all of

the future experiments in this thesis.

The chapterwill beginwith a separate study of the two speech processing tasks, giving

an overview of the methods and techniques that are involved in both tasks. Finally, a

novel framework will be developed based on the XM2VTS database that can be used

to evaluate both speech and speaker recognition performance using the same training

process.

60 4.2 Speech recognition

Figure 4.1: A typical speech recognition system, outlining both the training of speechmodels and testing using these models.

4.2 Speech recognition

Speech recognition is the process of converting human speech into a sequence of

words through a computer algorithm. A broad overview of a typical speech recog-

nition system is shown in Figure 4.1. Before the system can be use to transcribe un-

known speech, models must first be trained to recognise words based upon a training

dataset. The trained models can then be used with a testing dataset to evaluate the

system’s performance. These datasets will typically be subsets of the same database,

and could contain audio or video features, or possibly a fusion of the two.

By aligning the training data and transcriptions, speechmodels can then be trained for

words or sub-words present in the transcriptions. The choice of speech event model

used for speech recognition is generally dictated by the size of the vocabulary the

system is intended to accurately recognise. Systems that need to recognise a wide

range of words typically are best modelled using sub-words such as phonemes or

triphones, whereas systems that only need to recognise a limited subset of words can

get better performance by modelling each word individually. For this thesis, word

models will be used in all experiments as the vocabulary will be limited to English

digits, as will be discussed later in Section 4.4.

4.2 Speech recognition 61

Figure 4.2: Speaker dependent-speech recognition can be impractical for some appli-cations.

4.2.1 Speaker dependency

Speech recognition systems can be designed to work well for all speakers, or they can

be trained for the use of a particular speaker. These two options are generally referred

to as speaker independent (SI) or speaker dependant (SD) speech recognition, respectively.

The benefit of limiting recognition to a single speaker is that performance can gener-

ally be increased by an order of magnitude as the variations between speakers are not

an issue. However, SD systems tend to have scaling problems, as each speaker would

need to have their own set of speech models, which could be prohibitive for many

applications, as illustrated in Figure 4.2. Systems intended for individual use, such as

desktop computer speech recognition software, would be relatively easy to use in a

SD manner, but the users would likely still expect them to work adequately out of the

box, and improve with individual training.

To have adequately trained speech models, there should be many examples of each

word or sub-word in the training set to ensure that each model can be adequately

discriminated during decoding. This can easily amount to a very large amount of

data, especially if the intended vocabulary is large. While SI systems can collect this

over a range of speakers, SD systems obviously have much less training data available

unless the users are very cooperative (again, see Figure 4.2).

To help alleviate data shortage problems with SD speech recognition, the SD speech


models are typically trained through an adaption process from a SI speechmodel trained

on a set of speakers independent to the intended speaker. This adaptation process con-

sists of attempting to translate the variances existing between and within a set of SI

speech models onto a specific speaker to create a set of SD speechmodels. As the base

models for adaption are speaker independent they can be trained over a much wider

variety of speech events than any SD speech models could, and the adaptation pro-

cess should keep much of this variety while better modelling the intended speakers

speech.

4.2.2 Speech decoding

The performance of a speech recognition system is evaluated by comparing the es-

timated speech transcription with a known transcription of the speech event. The

methods of comparing the two transcriptions and arriving at a performance measure

differ depending upon how the speech is decoded, and can be split into systems that

recognise either isolated word or continuous speech.

Isolated word speech recognition

Isolated word speech recognition systems are designed and intended, as the name

suggests, to only recognise a single word at a time. Because this model of speech

recognition requires that the word boundaries must be known, a word-segmentation

front-end is required to perform this task before actual speech recognition can occur.

Therefore the sole task of the speech recogniser would be to determinewhat the words

are within the predefined boundaries. By reducing the freedom of the decoder in this

manner, an isolatedword speech recognition system can commonly outperform a con-

tinuous speech system in controlled conditions. As the word boundaries have already

been defined and presumably match the known transcriptions, isolated word recogni-

tion performance can be measured easily as a percentage of correctly (or incorrectly)

guessed words.

4.2 Speech recognition 63

Figure 4.3: An example of a possible voice-dialling speech grammar for continuousspeech recognition. Adapted from [194].

The need to determine the word boundaries separately from the speech recognition

task limits the application of isolated word recognition systems in real world situa-

tions. Manyword-segmentation algorithmsworkwell on speechwith pauses between

words, put the performance degrades significantly when recognising natural contin-

uous speech. Continuous speech is more readily recognised using classifiers that can

handle transitions between words automatically.

Continuous speech recognition

Continuous speech recognition systems combine the word-segmentation and word-

classification tasks into a single set of models. Rather than segmenting word bound-

aries before attempting to recognise thewords, continuous speech recognition systems

are designed to take a multi-word utterance and attempt to create a transcription out-

lining both the words spoken and the boundaries between them.

Before a continuous speech recognition systemcan be used, a recognition grammarmust

be created. The recognition grammar is a definition of the allowable paths through a

speech network that the speech recogniser can take. Provided that the actual speech

does fit the grammar, this can greatly help the speech recogniser when compared to an


exhaustive dictionary search for every word uttered. An example of such a grammar

for a voice dialing application is shown in Figure 4.3. Once the grammar has been

defined, the speech recognisers task is to use the speechmodels to determine the most

likely path through a speech network formed my joining the models as defined in the

grammar.

Once the speech recogniser has generated a transcription of the input speech, the per-

formance of the recogniser is evaluated by comparing the output transcription with

a known reference transcription. The two transcriptions are first aligned by perform-

ing by performing an optimal string match, without consideration to any actual tim-

ing information in the transcriptions. The two transcriptions can differ through three

main types of errors: insertions, substitutions and deletions, and the transcriptions

are aligned by assigning values to these errors and minimising the sum over the entire

transcriptions. Once the transcriptions are aligned, speech recognition performance

is measured in terms of the differences between the two transcriptions. Typically this

speech recognition performance is expressed in terms of an accuracy of the form [194]:

Accuracy =H − I

N× 100% (4.1)

where H is the number of matching words in the two transcriptions, I is the num-

ber of words incorrectly inserted into the estimated transcription (as compared to the

reference) and N is the total number of words in the known transcription. Many pub-

lications alternatively report a word error rate (WER) which is simply calculated as

the opposite of the accuracy:

WER = (100− Accuracy) =

(

1− H − I

N

)

× 100% (4.2)

While either of these metrics can be calculated for a single test sequence, by calcu-

lating them over a large testing set, a good idea of the speech recognition systems

performance can be obtained.

4.3 Speaker recognition 65

Figure 4.4: A typical automatic speaker recognition system, outlining both the trainingof speaker models and testing using these models.

4.3 Speaker recognition

As a research area, speaker recognition covers the use of human speech as a biometric

to identify or verify a speaking person. A broad overview of a typical speaker recogni-

tion system is shown in Figure 4.4. Comparing this system and the speech recognition

system shown in Figure 4.1, it can be seen that there are many similarities between

the two designs. Rather than training models to recognise speech events, for speaker

recognition, models are trained to recognise each speaker based on a training dataset

accompanied by speaker identities for each sequence. This section will cover a broad

overview of speaker recognition systems in comparison to speech recognition systems

as described in Section 4.2.

4.3.1 Text dependency

Speaker recognition can be performed either when the system knows, or can dictate,

what is being said, or alternatively when the speaker is free to say what they like.

These two design choices are referred to as text dependent (TD) and text independent

66 4.3 Speaker recognition

(TI) speaker recognition respectively. Because of the more limited nature of the task,

TD systems can generally outperform TI, but the limitation of having to know the text

of the utterance puts severe limits on the practical uses of TD speaker recognition. For

example, TD speaker recognitionwould not be usable in a situation where the speaker

is uncooperative, such as surveillance, where TI recognition would be more suitable.

TD speaker recognition is better suited to controlled circumstances, such as allowing

entry to a secure computer where the user would be fully cooperative.

Ideally TI speaker recognition could be used in all circumstances, and as such it can

be considered the ‘holy grail’ of speaker recognition. However, it does require a wide

variety of speech to train up models that can recognise speakers reliably regardless of

what is being said. To this end, databases used to design TI speaker recognition must

have a similarly large vocabulary to adequately train and evaluate TI speaker recog-

nition models. While this has become less of a problem in audio speaker recognition

research with large speech databases available such as the Wall Street Journal [134] or

TIMIT [79] corpora, audio-visual databases suitable for speaker recognition are much

thinner on the ground, and the only existing one with a large vocabulary is unavail-

able outside of its parent organisation [126]. For this reason, the speaker recognition

research in this thesis will focus on the TD case, using the small vocabulary database

XM2VTS which will be discussed in more detail in Section 2.4.2.

For TI speaker recognition systems, the speaker models are trained to have a single

model for each speaker over the entire vocabulary available, whereas for TD speaker

recognition, the speaker models are trained for a specific speech event. These TD

systems can be further divided into pass-phrase based systems or prompted text sys-

tems, based upon whether the person being recognised says the same phrase every

time or if they are prompted with a different phrase at each use, respectively. As the

prompt can be different for every use, prompted text speaker models are generated

from speaker-dependent word models concatenated together to match the prompted

text. Pass-phrase systems can be modelled using a single speech model for the en-

tire prompted phrase, but more flexibility can be obtained by modelling each word or


sub-word separately in a similar manner to prompted text systems. Such an approach

will form the basis of the TD speaker recognition experiments in this thesis.

4.3.2 Background adaptation

Background adaptation is used in speaker recognition to generate speaker dependent

models from background models. These speaker dependent models can then be used

to model the speakers. The benefit of adaptation over training speakermodels directly

on the speakers is that by starting with models trained on the large background set

the final speaker models can cope better with a large variety of speech than speaker

trained models. Additionally there may simply not be enough data for training di-

rectly on a specific speaker, whereas adaption can provide good performance with a

limited set of speaker-specific data.

Generally, twomain types of differences arise between speech events relevant to speaker

recognition, broadly summarised as between-speaker and within-speaker differences.

The between-speaker differences are obviously more important to recognising indi-

vidual speakers, but the models must also be robust to within-speaker differences,

which are generally related to varying speech events. As it is trained over a wide vari-

ety of speakers and speech the backgroundmodel can serve as a baseline for averaging

the within-speaker differences, which can then be incorporated through adaptation to

each speaker to form the final speaker models.

The adaptation of the speaker models is very similar to the adaptation of SD speech

models. Backgroundmodels are trained for each word, and then individually adapted

based on occurrences of the word within the training speech for a particular speaker.

While ideally the set of speakers used to train the background models should be sep-

arate from the speakers intended to be recognised by the system, it may be difficult to

achieve this due to the large amount of data required. As a compromise when there

is a limited amount of data available, as in audio-visual speech, the background mod-


els are commonly trained over all intended speakers and then adapted to the specific

speaker.

4.3.3 Evaluating speaker recognition performance

Speaker recognition systems can subdivided into speaker identification or speaker verifi-

cation. Identification systems are designed to choose the identity of a speaker from a

list of known speakers, possibly including the choice of ‘none of the above’. Alterna-

tively, in a verification system, the speaker claims a particular identity in somemanner

and the system must decide whether to accept or reject the speaker’s claim. This sec-

tion will outline these two models of speaker recognition and how performance is

evaluated in both.

Speaker identification

In a speaker identification system, the speech utterance is compared against all speaker

models available to the system. The scores returned by each of these models are then

used to rank the speakers according to which is most likely to have spoken the ut-

terance. Speaker identification can either be closed-set or open-set, depending upon

whether the possibility exists that a unknown speakermay be presented to the system.

For open-set speaker identification, for which this possibility does exist, an additional

background model may be required to serve as a threshold to catch out-of-set speak-

ers.

The output of a speaker identification systems is typically provided in the form of a

top N list of most-likely speakers for a given utterance. Over a large number of test

utterances, the speaker identification performance can be measured as the number of

times the correct speaker appeared within the top N.

One of the major drawbacks of speaker identification is that every utterance must


be tested against all possible speaker models. While this may be practical in some

circumstances, such as spotting a small number of suspicious people in surveilance

video, the time taken to test each models can be a limitation in other circumstances.

Speaker verification

Instead of having to choose a speaker’s identity as in speaker identification, speaker

verification only requires the system to verify a speaker’s claimed identity. To verify

the speaker’s claim, only one speaker model is consulted (i.e. the claimed speaker) as

compared to all speaker models for an identification task.

Speaker verification systems are typically designed to generate a score that represents

the likelihood of the claimed speaker being the same as the speaker who produced the

utterance. This verification score is calculated using the claimed speaker’s models,

from which the background speaker model’s score is then subtracted to normalise the

verification score for the length of the utterance and environmental factors. Finally

the normalised score can then be compared against a threshold to decide whether to

accept or reject the speaker’s claim.

To evaluate verification systems completely, both speakers who correctly claim their

identity and speakers claiming another identity, referred to as clients and impostors, are

used to test the system. This evaluation is based upon the rate of verification errors,

of which there are two types: misses, and false alarms. Misses occur when a client is

incorrectly rejected, and false alarms occur when impostors are incorrectly accepted.

These two types of errors can be considered to be in opposition, and lowering one will

cause the other to rise.

The process of trading-off these two errors comes aboutwhen choosing the accept/reject

decision threshold. If the threshold is low, few misses will occur, but there will be

many false alarms. Correspondingly, a high threshold will cause more misses and

less false alarms. These choices can be illustrated through the use of detection error


0.1 0.2 0.5 1 2 5 10 0.1

0.2

0.5

1

2

5

10

False Alarm probability (in %)

Mis

s pr

obab

ility

(in

%)

System 1System 2

Figure 4.5: An example of a DET plot comparing two systems for speaker verification.

trade-off (DET) [112] plots, showing the two error rates at each possible operating point

of a system. DET plots, of which an example is shown in Figure 4.5, can be used to

succinctly illustrate the relative performance of a number of verification systems on a

single plot.

As the axis of the DET plots represent errors in the verification system, the best perfor-

mance is obtained as the results move towards to bottom left of the plot. The dashed

line represents the point at which the false alarms and misses are equal, which is re-

ferred to as the equal error rate (EER). This point can serve as one possible summary

of a DET curve when multiple systems are compared.

For speaker recognition research, speaker verification is generally considered more

useful for two main reasons:

• Speaker verification requires less models to test (only claimed speaker and back-

ground)

4.4 Speech processing framework† 71

• DET plots provide a more detailed look at performance than a single top N

speaker identification percentage

For real world applications speaker verification is limited due to the need to have an

identity claim. This identity claim may consists of a pin number or identity card, but

both would obviously require the cooperation of the speaker. For this reason verifi-

cation would be more suited to security access applications, rather than surveillance

where identification would be more suitable (provided that only a limited number of

subjects are identities are being considered).

4.4 Speech processing framework†

For the experimental work in this thesis, a comprehensive framework was required to

meet the following criteria:

1. Can be used for both SD and SI speech recognition experiments,

2. Can be used for TD speaker verification experiments, and

3. Allows models to be re-used where possible

By examining themodelling requirements of both speech and speaker recognition pre-

sented earlier in this chapter, it can be seen that many of the requirements are common

to both methods. The same models used for SI speech recognition can also serve as

the background models for speaker adaptation and for performing score normalisa-

tion in speaker verification. In a similar manner, the adapted speech models used

for SD speech recognition can also serve as the TD speaker verification models. An

overview of the speech processing framework used in this thesis, which combines the

two speech processing tasks in this manner is shown in Figure 4.6. After examining

the data segmentation requirements of this framework, each of the sections of this

framework will be examined in more detail in the following subsections.

72 4.4 Speech processing framework†

Figure 4.6: Overview of the speech processing framework used in this thesis.

4.4.1 Training and testing datasets

For the experimental work performed in this thesis, this framework will be imple-

mented on the digits section of the XM2VTS database (see Section 2.4.2). It was de-

cided not to use the phonetically balanced phrase (“Joe took fathers green shoe bench

out”) as each word in the phrase was only represented half the time of each of the dig-

its. As a result, this framework is based on two shots of two 10-digit strings for each

of the 4 sessions in the XM2VTS database, for a total of 4720 (2× 2× 4× 295) repeats

of each digit over the entire database.

As can be seen from Figure 4.6, implementing this framework requires that the database


be divided into five datasets:

• client training,

• client evaluation,

• client testing,

• impostor evaluation, and

• impostor testing

For this framework it was decided to use the same dataset for background training

and adaption due to the limitation of the relatively small number of speakers avail-

able compared to audio speech processing databases. Additionally, this configuration

allowed the framework to stay as close to the existing XM2VTS protocol [107] as pos-

sible, which did not define a background set.

While the evaluation and testing datasets are separate within the database, they are

both treated exactly the same within the framework. The evaluation datasets are used

to test and tune the speech processing algorithms and to enable parameters of the

modelling and recognition to be estimated. Once these parameters have been deter-

mined, the speech processing algorithms can be re-run on the testing dataset to report

the final speech or speaker recognition performance.

The first split of the database for this protocol was over the speakers to create the

client and impostor sets. This split was performed in the same manner as the existing

XM2VTS protocol [107], with 200 client and 95 impostor speakers. Client speakers

are used in the framework to train the background models, as well as adapting the

speaker models. For testing and evaluation, these speakers test both SD speech recog-

nition and server as the clients for speaker verification. The impostor speakers are not

involved in training at all, but are used solely in testing both SI speech recognition and

challenging speaker verification with unknown speakers.


Table 4.1: Configurations of the XM2VTS clients possible under this framework.

As the client speakers are used for both training and testing/evaluation, a further split

had to be made over the XM2VTS sessions defining which sessions are to be used to

train the models and which are used for testing and evaluation. For this framework,

the XM2VTS protocol’s Configuration II (Figure 2.4(b)) was chosen as the basis, with

two sessions for the client training and one session each for evaluation and testing.

However, to allow for a larger number of experiments to be run than is possible under

the XM2VTS protocol, 12 configurations of 2-train/1-test/1-evaluation were defined,

shown in Table 4.1. As can be seen in the table, this configuration resulting in 6 distinct

training partitions, for which the evaluation and testing partitions can be swapped to

result in the 12 partitions. While it is not necessary to run all 12 configurations within

the framework, the more configurations that are used allows for more speech process-

ing experiments, and greater confidence in the performance measures reported.

Before the XM2VTS database could be used to train up speech-based models, a tran-

scription of the database had to be obtained. While the database was not suppliedwith

a time-aligned transcription, the textual contents of each shotwas clearly indicated. By

using external speech models trained up on the large Wall Street Journal [134] audio

speech corpus to estimate the boundaries of the known transcriptions, a good esti-

mate of the time-aligned speech transcriptions was obtained. These transcriptions

were used both to train and test the speech models within this framework.

4.4.2 Background training

Within this framework, the background speech models are trained from the client

training dataset. These background speech models are used for the SI speech recogni-


tion, in speaker verification to normalise the claimedmodel scores. Additionally, these

background models serve as the base models for adaptation of the speaker-dependent

speech models.

As the vocabulary of this dataset is very small, with only the 10 English digits in

use, no benefit was found in modelling below the word level. Accordingly, it was

decided to form the backgroundmodels over entire words, resulting in 10 background

word models and 1 background silence model. By synchronising the client training

transcriptions with the audio, video or fused features intended for modelling in the

client dataset, the features corresponding to eachword can be separated and combined

over all client speakers to form the each of the background models.

4.4.3 Speaker adaptation

The adaptation of the speaker models is also performed on the client training dataset.

These adapted speechmodels are used for the SD speech recognition, and also to form

the speaker models for speaker verification.

Speaker adaptation is performed in a very similar manner to background training.

Instead of training models over the entire client speaker set, adaptation is performed

by taking a particular backgroundmodel and adapting it to all of the matching speech

events for one particular speaker. This adaptation process is performed for each of the

background models and for each speaker. The result of the speaker adaptation is the

11 speech models (10 digits and silence) having an adapted form for each of the 200

client speaker within the framework.

4.4.4 Speech recognition

Both SD and SI speech recognition can be accomplished, based upon the choice of

models and testing data used to perform the speech recognition experiments. SI


Figure 4.7: Word recognition grammar used in this framework.

speech recognition is performed by testing the background models against all utter-

ances by the ’imposter’ speakers. SD speech recognition is performed by testing each

speaker’s adapted speech models against their utterances in the client evaluation or

testing dataset, and the final performance is reported over all client speakers. The

specific details of how HMM-based modelling techniques will be used to recognise

speech will be covered in more detail in later chapters.

Other than the choice of models and datasets, both SI and SD speech are tested in an

identical manner, through the evaluation of continuous speech recognition. Because

the vocabulary tested is very small (English digits) the grammar used for recognition

is a simple word loop with silences, as shown in Figure 4.7. Once the estimated tran-

scriptions have been generated, performance is calculated as a WER by comparison

against the actual testing transcriptions supplied with the database.

4.4.5 Speaker verification

For this thesis, it was decided to concentrate on speaker verification as it has the ben-

efit that it less comparisons are required between speaker models, and additionally

provides for finer-grained differentiation between verification systems to be obtained

than with a single rank-N correct percentage obtain from speaker identification exper-

iments.

To get a reasonable idea of the speaker verification performance, each client’s models

are compared against a number of impostor shots as well as the matching client shots.

4.5 Acoustic and visual conditions 77

For each of the 400 client shots in the client testing (or evaluation) dataset, 20 im-

postor shots are randomly selected from the impostor dataset and used to attempt to

gain entrance while claiming to be the client. Because every speaker in the XM2VTS

database speaks the exact same phrase, there were no issues with finding identical

client/impostor phrases to test the speaker verification system.

TD speaker verification can be performed by choosing the appropriate speaker adapted

models. As the phrase spoken by the clients or impostors was known and identical

over all speakers, the speaker recognition model was formed by concatenating the

claimed client word models together to form the known phrase spoken. The back-

ground speech models used for normalisation are similarly formed by concatenating

the background speech models. Speaker verification performance will be presented

as DET plots under this framework, although they may be simplified to a single EER

measure when space considerations dictate.

4.5 Acoustic and visual conditions

In addition to conducting the speech processing experiments in clean conditions, the

acoustic modality was also corrupted with a source of background office babble [176]

at a signal-to-noise ratio (SNR) of 0, 6, 12 and 18 decibels (dBs) to investigate the

robustness of the recognition experiments to train/test mismatch. Visual degradation

was also considered for this thesis, but was not included due to time constraints and

the difficulties of simulating real-world sources of visual degradation [70].

All training and adaptationwas performed using the clean data, while final evaluation

of the speech and speaker recognition systems can be conducted over the clean and

noisy conditions. Finally, when the speech processing experiments are considered in

noisy conditions, only the noisy conditions are presented as the 18 dB SNR acoustic

data was largely indistinguishable from the clean data.


4.6 Chapter summary

This chapter has provided a summary of both speech and speaker recognition in

audio-visual environments. A novel framework was presented that can be used to

perform both speech and speaker recognition on the XM2VTS database.

By examining both speech and speaker recognition in detail, this chapter has shown

that many of the models and techniques are common to both recognition paradigms.

Taking advantage of these commonalities, the framework developed in this chapter

can be used to test both speech and speaker recognition using a single training pro-

cess. By having a single set of models that can be used both to recognise speech and

speakers, the similarities and differences between the two speech processing tasks can

be examined throughout this thesis without having to be concerned about differences

in training for either task.

Chapter 5

Feature Extraction

5.1 Introduction

The aim of feature extraction is to convert the raw observations into a concise set of

features suitable for classification. In ideal circumstances, the feature extraction should

be able to divide the observations into distinct, non-overlapping regions in a multi-

dimensional feature space for each class under consideration, such that the job of the

classifier is trivial. However, this does not typically happen in real-world scenarios,

so the aim of feature extraction is generally to reduce the number of features used for

classification, whilst still maintaining good separation between classes, and providing

some measure of invariance to changes in observations within the groups chosen for

classification. This chapter will conduct a review of the existing literature in audio

and visual feature extraction, with particular focus on the extraction of visual speech

features.

Recently video features designed to emphasise the temporal nature of human speech

have been implemented and shown much better performance than static features for

speech recognition. This chapter will outline a particular implementation using a cas-

cade of appearance-based feature extraction techniques to form a dynamic represen-

80 5.2 Acoustic feature extraction

tation that has been shown to work well for speech recognition applications.

However, such features have not been examined in detail for speaker recognition to

date, and as the goals of the speech and speaker recognition applications under the

speech processing banner are quite distinct, it stands to reason that the features best

suited to each task may not match. To this end, a novel study will be presented of

various video feature representations for both speech recognition and speaker iden-

tification. The models and the performance obtained using them will also serve as a

baseline for future experiments in this thesis.

5.2 Acoustic feature extraction

5.2.1 Introduction

Extraction of suitable feature vectors for speech processing applications from acoustic

signals is a very mature area of research [34, 132, 157, 154] and, as such, will not be

covered in any great detail in this thesis. This sectionwill cover a brief overview of the

main concepts of acoustic feature extraction, followed by a similarly brief introduction

toMFCC and PLP-based acoustic feature extraction techniques, which will both be ex-

amined as acoustic features experimentally throughout this thesis. All acoustic feature

extraction for this thesis was performed with the HMMToolkit, and more information

on this topic is available in Chapter 5 of The HTK Book [194].

Like all feature extraction techniques, the aim of acoustic feature extraction is to form

a concise representation of the relevant features, while providing invariance to irrel-

evant features of the input acoustic signal. The relevancy of these features is eval-

uated in terms of the intended application of the classifier. For example, a speaker-

independent speech recognition application would be interested in changes relevant

to differing words, and unconcerned about changes due to differing speakers, but for

a speaker identification application the opposite may apply.

5.2 Acoustic feature extraction 81

The process of acoustic feature extraction can be divided into pre-processing, filter-

bank analysis, and the extraction of features from the accumulated filter banks. These

features can then also be augmented with energy and time derivative features. The

details of these processes will be covered in the following subsections.

5.2.2 Pre-processing

Acoustic speech is a naturally varying continuous signal with the characteristics of

the signal varying considerably over time. To be intelligible, the raw speech signal

is generally recorded at a sampling rate of at least 8 kHz, the standard for telephone

speech. However, even at the low quality of telephone speech, the (comparatively

low) sampling rate employed still results in too many features for most classifier to

reliably handle. Fortunately, the variance in the speech signal can be considered slow

enough such that its statistics can be considered quasi-stationary over segments of up

to 100 milliseconds [64], and most acoustic feature extraction occurs over windows of

the acoustic signal of up to that length.

For the experiments conducted in this thesis a Hamming window function is used to

divide the incoming acoustic speech signals into 25millisecond-lengthwindows every

10 milliseconds, resulting in a 100 speech feature vectors extracted for every second of

speech. Before windowing a pre-emphasis function was used to flatten the frequency

characteristics of the speech signal, to compensate for the tendency of speech to have

most of its energy in the low frequencies [64, 194].

5.2.3 Filter bank analysis

Once the pre-processing has divided the incoming speech single into quasi-stationary

windows, the frequency spectrum of the speech signal within each time-window is

examined to generate the final speech features. The two most commonly chosen tech-

niques of spectrum analysis are

82 5.2 Acoustic feature extraction

1. linear prediction analysis and

2. perception-based filter bank analysis.

Linear prediction analysis is based onmodelling the vocal trackwith an all-pole model,

whereas filter bank analysis derives from a human-perception based filter bank on the

power spectrum of the signal. For this thesis, the filter bank analysis technique was

chosen, as such features can be calculated more easily [194], whilst still performing

extremely well for speech processing tasks when compared to features derived from

linear prediction analysis [34, 157].

5.2.4 Mel frequency Cepstral coefficients

Filter bank analysis is based upon studies on the perception of speech, showing that

the human ear resolves frequencies non-linearly across the speech spectrum [194].

This non-linear behaviour can be approximated by a triangular filter bank spaced

across the spectrum according to the human-perception based Mel scale [153, 194],

defined by

Mel ( f ) = 2595log10

(

1+f

700

)

(5.1)

Once the filter bank has been calculated, the incoming speech window is transformed

into the frequency domain using a fast Fourier transform (FFT) and the magnitudes in

the frequency domain are binned according to each the value of each filter to arrive at

N weighted sums, [mi : 1 < i < N], for each window. Finally, Mel-frequency cepstral

coefficients (MFCC) can be calculated by taking the discrete cosine transform (DCT)

of the log of those accumulated amplitudes [194],

ci =

√

2N

N

∑j=1

mj cos[(2j + 1)iπ

2N

]

(5.2)

5.2 Acoustic feature extraction 83

5.2.5 Perceptual linear prediction

PLP features [72] are an alternative to MFCC-based feature extraction that are also

popular for acoustic speech processing tasks [157, 73]. As suggested by its name,

this method of acoustic feature extraction can be seen as an approach combining both

linear prediction analysis and the perception-based filter banks.

As implemented by the HTK Toolkit [194], and used in this thesis, PLP-based fea-

tures are calculated based on the same Mel-scale filter bank as used for MFCC feature

extraction. The Mel filter-bank coefficients, [mi : 1 < i < N], are first weighted by an

equal loudness curve and compressed by taking the cubic root. The resultingmodified

acoustic spectrum is then converted to Cepstral coefficients in an identical manner to

linear prediction analysis [194].

5.2.6 Energy and time derivative features

In addition to the spectrum based coefficients, a number of other components can

added to each feature vector to improve the speech processing performance. The two

main types of additional features are an energy term and features calculated from time

derivatives.

The energy term is used to augment the spectral features, and is computed as the log

of the signal energy. That is, for speech samples [st : 1≤ t ≤ T] corresponding to a

particular audio window,

E = logT

∑t=1

s2t (5.3)

Various normalisation and adjustment techniques can be applied to this energy term,

which for this thesis, are left at the default settings of the HTK Toolkit [194].

Time derivative features are used to allow the classifiers to have limited knowledge

84 5.3 Visual front-end

Figure 5.1: Configuration of an acoustic feature vector including the static (ci) andenergy (E) coefficients and their corresponding delta and acceleration coefficients.

of the dynamic changes in the acoustic features, rather than just their values in each

particular window. Typically, both first and second order derivatives, referred to as

deltas and accelerations are added. These time derivative features are calculated by

di,t =∑

Θθ=1 θ (ct+θ − ct−θ)

2∑Θθ=1 θ2

(5.4)

where di,t is the delta coefficient at time t of static feature ci calculated in terms of

the surrounding static coefficients (including the energy term) [ci,t−Θ, . . . , ci,t+Θ]. The

value of Θ is the window size for calculating the derivatives, which is typically set

to 2. The corresponding acceleration coefficients can be formed by using (5.4) on the

delta features instead of the static. The final feature vector for each window is then

the concatenation of the static spectral coefficients and energy with the deltas and

acceleration features as shown in Figure 5.1.

5.3 Visual front-end

While an acoustic speech signal can be simply represented as amplitude values at the

sampling rate of the recording device, the raw visual speech signal consists of an entire

image for every video-frame collected by the camera. Although video frame-rates are

many orders of magnitude lower than acoustic sampling rates, the large amount of

information in each frame still presents problems for the extraction of visual speech

features.

The first major problemwith extracting visual speech features is that in many applica-

tions, much of every video frame is unrelated to the visual speech. It is widely agreed

5.3 Visual front-end 85

Figure 5.2: The visual feature extraction process, highlighting the visual front end,encompassing the localisation, tracking and normalisation of the lip ROI.

that the majority of visual speech information emanates from the region around the

speaker’s mouth [95], and therefore the extraction of visual features should be primar-

ily based upon this area. The task of the visual front-end is to locate, track and normalise

this ROI over an entire video, before the visual speech feature extraction stage can oc-

cur, as shown in Figure 5.2. Some researchers also include the feature extraction step

within the visual front-end [145], but for this thesis the front-end will refer solely to

the location and normalisation of the ROI.

5.3.1 The front-end effect

Clearly the accuracy of the visual front-end will have a large effect on the final accu-

racy of the speech processing system. If the tracking of the ROI is poor, then the per-

formance of the final classifiers using the tracked features will be unreliable as they

will not always be evaluated on a consistent ROI. This is referred to as the front-end

effect, and can be expressed simply as

p (c) = p (c| f ) p ( f ) (5.5)

Where c represents the speech or speaker classifier working correctly, and f represents

the front-endworking correctly. It can be seen that an ideal front end (p ( f ) = 1) would

allow effort to be solely faced on improving the classifier performance.

86 5.3 Visual front-end

5.3.2 A brief review of visual front-ends

One simple way to make p ( f ) approach 1 is to manually label the ROI [54, 86], which

is the approach taken in this thesis, as the classifier performance is of primary interest.

In a similar vein, many early AVSP applications [5, 159] made use of the DAVID [26]

database, which had blue-highlighted lips to allow a nearly trivial front-end imple-

mentation.

While manual labelling can be useful on previously captured data in research set-

tings, it is unrealistic for real world circumstances, where most speech processing

applications should work unsupervised. Automatic mouth ROI detection methods

vary considerably in the literature, but most approaches start with a broad face de-

tector and move to specific facial feature detectors to finally locate and normalise the

mouth ROI [146]. Many of the methods and techniques are shared between the face

and facial feature detectors, but there exists no single method that works well in all

circumstances.

In a survey of over 150 publications in face and facial feature detection, Yang et al. [192]

established that four broad groups of facial detection algorithms could be defined:

1. Knowledge based methods, encoding human knowledge of what makes a face.

Generally these methods are based upon the spatial relationships between facial

features.

2. Feature invariant approaches, where a bottom up approach is used to attempt

to generate features that are robust to the conditions in which the video has been

collected. Some examples are texture and chromatic based features.

3. Template matchingmethods, where a number of face or facial-feature templates

are stored and compared against test images to track faces.

4. Appearance based methods, where a model is generated on a set of training

face or facial features to adequately capture the variability of facial appearance.

5.3 Visual front-end 87

Figure 5.3: Manual tracking was performed by recording the eye and lip locationsevery 50 frames and interpolating between.

Many different modelling techniques have been used for this purpose including

neural networks, HMMs and support vector machines.

Whilst there is little consensus on the best method of achieving a well performing vi-

sual front-end, one appearance-based method that has been developed recently and

shown good performance in implementing a AVSP visual front-end is the Viola-Jones

algorithm [179]. The Viola-Jones algorithm is a generic object detection algorithm

formed by a cascading chain of very simple features derived from intensity values

in the region being searched. A thorough review of this algorithm as a front-end for

both frontal and profile visual speech is presented by Lucey (2007) [102].

5.3.3 Manual front-end implementation

As this thesis is not concerned with the performance of the visual front-end, a manu-

ally tracked visual front-end approach was chosen to allow p ( f ) in (5.5) to approach

unity, and allow the thesis to focus on improving the classifier performance under the

assumption of a well-performing front-end.

To allow the location of the mouth ROI to be known for every frame in the database, a

volunteer was recruited to track the locations of the eyes and mouth in every 50th

frame, or approximately every 2 seconds, of each video in the XM2VTS database.

Some examples of the manually tracked frames are shown in Figure 5.3. The loca-

88 5.4 Visual features

Figure 5.4: Some examples of the original and grey-scaled resized ROIs extracted fromthe XM2VTS database.

tions of these points on the intermediate frames were then interpolated from these

landmark areas.

Once the locations of the eyes and mouth were determined for a particular frame,

the mouth image was chosen as a 120× 80 pixel region centred on the mouth region

and with the long side parallel to a line drawn between the eyes. This image was

then gray-scaled and down-sampled to a 24× 16 region to reduce the number of raw

pixels for the feature extraction stage. The down-sampling and gray-scaling are not

expected to affect lipreading performance based on work by Jordan and Sergeant [82]

and Potamianos et al [149]. Some examples of lip ROIs generated from this manual

tracking process are shown in Figure 5.4.

5.4 Visual features

Once the relevant ROI has been located, feature-extraction must be performed on the

region before classification can occur. It is widely agreed in the literature [146, 25] that

extraction of visual speech features can be divided into three main categories:

1. Appearance based,

2. Contour based, and

5.4 Visual features 89

3. A combination of both appearance and contour

This section will summarise the techniques that have been developed for extraction

of visual speech features under these categories, noting the advantages and disadvan-

tages of each method. Finally a comparison of the three methods will be made.

5.4.1 Appearance based

Appearance-based methods are designed to take all the information available in the

ROI and producing a single feature vector based on the appearance of the ROI. The

appearance-based methods generally are not designed to make any assumptions of

which components of the ROI are important for speech, and allow features to be ex-

tracted from the entire ROI and not just the lip movements. Some of these additional

visual indicators that can be relevant to visual speech are the visibility of the tongue

and teeth, as well as any visible muscle or jaw movements [172].

While the size and shape of the ROI for appearance-based features is typically a square

or rectangular region centered on the mouth region of the face [146], this does not

have to be the case. Illustrating the extremes of these approaches, some researchers

have extracted features from the entire face region [116], whereas others have used

disc-shaped regions around the lips to limit the amount of non-lip pixels being ex-

tracted [47]. Indeed, some researchers have even used temporal information to form a

three dimensional ROI [99, 141] from which the feature extraction is performed.

Within the extracted ROI the pixel values, either colour [28] or grayscale [49, 54], are

typically concatenated to form a monolithic feature vector. However, frame-to-frame

differences [164] or optical flow analysis [67, 19] have also been used to form alterna-

tive feature vectors with more of a focus on the dynamic information available in the

ROI. While some approaches have used such feature vectors directly [15, 47], the sheer

number of pixels, and therefore feature dimensions, in any reasonable sized ROI can

overwhelm the parameter estimation of many classification techniques, in an effect


commonly termed the curse of dimensionality [10].

It is therefore a primary goal of visual speech feature extraction that the number of

features available to the classifier stage can be reduced whilst still maintaining the

discriminative power of the remaining features. This is a similar goal to that required

in face recognition, and therefore many visual speech feature extraction techniques

mirror their earlier counterparts in face recognition research. The earliest example

of this propagation from face recognition to AVSP was Bregler and Konig’s work on

eigenlips [16], which was closely based on Turk and Pentland’s groundbreaking face

recognition paper introducing the principal component analysis (PCA) based feature

extraction technique eigenfaces [175]. Since its introduction, PCA based feature extrac-

tion have become one of themost common feature extraction techniques used in AVSP

research [47, 99, 67, 86, 106, 146].

However, one of the problems with the PCA approach is that a corpus of represen-

tative eigen-ROIs must first be generated before an unseen ROI can be projected into

the eigen-ROI feature space. By comparison a number of linear image transformation

techniques such as discrete wavelet transforms (DWTs) [45, 140, 142] and discrete co-

sine transforms (DCTs) [46, 47, 5, 126, 164, 71, 54] have been used to remove redundant

information in the ROI. The higher energy components can then be extracted from the

transformed image to form a compact feature representation. In particular DCT fea-

ture extraction has been shown to perform as well as PCA techniques for most AVSP

tasks with the benefit of not requiring training to establish the feature-space [67, 147].

The feature-reduction methods presented above have been shown to perform well

for AVSP tasks. However, these methods simply produce a compact representation

of the entire ROI, and are blind to the classification of the reduced features. While

these algorithms have shown good ability to discriminate between speech or speaker

classes, they were not designed with such a discriminative ability in mind.

To create a better feature representation, the linear discriminant analysis (LDA) algo-

rithm [48] can be used to firstmap visual data to specific classes, andwell-discriminating


Figure 5.5: Contour-based feature extractions used the geometry of the lip region asthe basis of feature extraction.

features can be extracted with the mapped classes in mind. LDA was first proposed

for such an extraction of visual speech features by Duchnowski et al. [47], where the

LDA step was performed on the raw pixels in the ROI. However, the LDA transfor-

mation matrix can easily get too large to calculate using this approach, and is com-

monly performed as a stage in a cascade beginning with an earlier PCA [100, 57] or

DCT [145, 21] feature extraction stage. Indeed, Potamianos et al. [144] found that an

additional maximum likelihood linear transformation (MLLR) stage after the LDA

improved speech-reading performance further still.

5.4.2 Contour based

The choice of contour- or geometry-based visual speech features over appearance-

based comes from a desire to represent the visual speech directly as the position of the

visual articulators during the speech event. Because the articulators are represented

based on their position relative to the face, contour based features first require that

the positions of the articulators are first located within the wider ROI, before the fea-

tures can be extracted. For contour-based feature extraction, only the positions of a

relatively small number of points are desired, compared to the large number of pix-

els in the ROI for appearance-based methods. The combination of minimal features,

and that the features are recorded directly from the positions of the visible articulators

should provide good performance for AVSP tasks. However, contour extraction does

require further localisation and tracking of the articulators, and this secondary front-

end effect (after the front-end effect in ROI tracking) can have a major effect on AVSP

performance.


Many systems using contour-based features taken directly from geometric measures

based on the tracked points, such as the height, width, perimeter or area of the inner

and/or outer lip contours [136, 109, 142, 5, 159, 87]. In addition to the lips, the presence

or absence of the tongue and teeth [198, 63] can also be used as features, particularly as

many visemes are identical but for such features. Alternatively, the location of the vis-

ible articulators can be modelled and the features can be obtained from the parameters

of these models. Some of these parametric modelling approaches include active shape

models (ASMs) [110, 116, 184], snake based algorithms [28, 51] and deformable lip

templates [196, 22]. The choice of geometric or parametric contour feature extraction

is similar to that of non-parametric versus parametric classifiers introduced in Chap-

ter 3, and it is not yet clear which approach is better suited to contour-based feature

extraction.

5.4.3 Combination

As the nature of appearance- and contour-based features are quite different, the prop-

erties of the ROI that are represented in either method are quite different. Contour-

based feature extraction can be considered to extract high level, and appearance-based

low level, speech related features, and therefore can be considered complementary

sources of information in fusion. A number of combinational approaches have shown

promise for visual speech feature extraction, of which the simplest is basic concatena-

tion of the appearance- and contour-based feature vectors, possibly followed by fur-

ther feature reduction steps. Rather than collecting the intensity information from the

ROI directly, the intensities are collected based on the location of the tracked lip con-

tours, so that regions outside of the lip region are not considered for appearance-based

feature extraction [110, 49, 186]. A parametric combinatorial approach was taken in

active appearance models (AAMs) [126, 30, 116] which combine ASMs and intensity

information into a single model, from which speech features could be extracted.


5.4.4 Choosing a visual feature extraction method

The choice between appearance and contour-based features is not yet an easy one,

despite the large body of accumulated AVSP research using various techniques from

either methodology and combinations of both. Even within the two camps, there is

little consensus on the best feature extraction methods, so comparing the methodolo-

gies is an even more difficult task, with no comprehensive comparison of various ap-

proaches yet completed in the literature. In limited comparisons (both in techniques

tested, and AVSP tasks evaluated), it has been demonstrated that appearance-based

features outperformed contour-based [142, 164], and combination strategies tend to

perform better than appearance-based alone [28, 115]. However, in a large vocabulary

speech recognition application, Matthews et al. [116] found that appearance models

outperformed the combinational AAM approach.

Intuitively, it would seem that ideal contour-based features would provide better per-

formance that appearance based, because they are related directly to the movement

of the visible articulators. Appearance-based features also can contain a large amount

of information, such as illumination or variations in speaker appearance, that may

be unimportant to modelling either speech or speakers [162], and that contour based

feature extraction methods are largely invariant towards.

However, the ability of contour based methods to accurately represent the position of

the lips is highly dependent upon the front-end-effect of the lip tracking itself. Even if

the lip tracking can be performed accurately, it does introduce an additional complica-

tion to the visual front-end. Therefore, any potential performance increase of contour

based methods must be considered in trade-off with the extra processing in the video

front-end, and a minimal increase may not be worth the additional complication. By

comparison appearance-based methods only rely on a course localisation of the lips,

making the feature extraction much more stable, especially in poor environmental

conditions.

The case for appearance-based features is furthered in a comprehensive review of re-

94 5.5 Dynamic visual speech features

cent visual speech recognition systems by Potamianos et al. [145]. They found that

appearance-based features can extract information about all articulators present in the

ROI, whereas contour based methods had to explicitly track the articulators, and quite

often the teeth, tongue and jawmuscles were not considered. This conclusion was also

motivated by perceptual studies that showed human speech perception improved

when the entire ROI could be seen versus movement alone. Potamianos et al. also

found that appearance-based features can be extracted much quicker than contour-

based, as there was no further need for tracking after the ROI was located. As most

real-world implementations must be expected to extract features at the frame-rate of

the video, this point is a particularly cogent one for the choice of appearance-based

feature extraction techniques. Correspondingly, appearance-based feature extraction

was chosen as the visual feature-extraction methodology of choice for the experimen-

tal work performed in this thesis.

5.5 Dynamic visual speech features

While generally demonstrated to perform as well as or better than contour based

methods, most simple appearance-based methods do tend to contain a significant

amount of information irrelevant to the visual speech events. However, there are a

number of techniques that have been demonstrated that attempt to extract the maxi-

mum visual speech information from the ROI, whilst discarding unwanted variance

due to other factors. This section will begin with a background of existing methods

of dynamic visual speech features, and will then detail the visual feature extraction

technique used for this thesis.

5.5.1 Background

As visual speech is fundamentally represented by the movements of the visual articu-

lators, the best features for representing visual speech are generally considered focus

5.5 Dynamic visual speech features 95

on the movement of these features, rather than the stationary appearance within each

frame [16, 108]. While this is clearly the case in speech recognition applications, it is

not completely clear that this would apply for speaker recognition, where the static

features, such as skin colour or facial hair, of the ROI may be useful for identity pur-

poses [113, 25]. A number of researchers have shown that purely dynamic features

can work well for speaker recognition applications [20, 129, 52], although there have

not been any significant comparison of dynamic features with existing static features

in the literature. Such a comparison will be conducted in Section 5.6 of this thesis.

The simplest method of attempting to extract dynamic information from the video fea-

tures is through the use of time-derivative-based delta and acceleration coefficients.

These coefficients are generally used in addition to the original static feature val-

ues [146], although some researchers have discarded the static and used only the time-

derivative features [54]. In a similar manner, rather than calculating frame differences

using extracted features, the ROIs can be converted into frame-to-frame difference im-

ages before feature extraction can occur [67].

While time-derivative features, whether calculated before or after normal feature ex-

traction, show the differences between adjacent frames, they do not directly indicated

the movement of the visual articulators. For this purpose features based on calcu-

lating the optical flow [7] within the ROI have been used widely for both speech and

speaker recognition applications in the visual domain [120, 20]. However, it is not

clear that there is any performance increase when compared to time-derivative-based

features [67, 120].

One technique that has recently shown good performance in AVSP applications is

the use of LDA to extract the relevant dynamic speech features from the ROI [126, 145,

123]. To emphasize the dynamic features, the static features of a number of consecutive

ROIs are first concentrated before speech-class based LDA is performed based on a

know transcription. This approach will form the basis of the visual feature extraction

of this thesis.


Figure 5.6: Overview of the dynamic visual feature extraction system used for thisthesis.

5.5.2 Cascading appearance-based features

The current state-of-the-art in visual speech feature extraction is a multi-stage cas-

cade of appearance-based features extraction techniques developed by Potamianos

et al. [147]. This approach has been shown to work well for both speech [145] and

speaker [123] recognition. A simplified version of Potamianos et al.’s cascade will

form the basis of the visual speech feature extraction techniques used for the experi-

mental work in this thesis.

An outline of the simplified feature extraction system is shown in Figure 5.6, and can

be seen to have three main stages:

1. Frames are (optionally) first normalised to remove irrelevant information,

2. Static features are extracted for each individual frame, and

3. Dynamic features are calculated from the static features over several frames

Frame normalisation

Before the static features can be extracted from each frame’s ROI, an image normali-

sation step is first performed to remove any irrelevant information, such as illumina-

tion or speaker variances. In Potamianos et al.’s original implementation of the cas-

cade [147], this step was performed using feature normalisation on the static features,

but image normalisation has been shown to work slightly better due to the ability


to handle variations in speaker appearance, illumination and pose as part of a wider

pre-processing front-end [102]. As such, image mean normalisation was chosen over

feature mean normalisation for this thesis.

This image normalisation step consists of calculating the mean ROI image, I over an

entire utterance [I1, I2, . . . , IT ]:

I =∑

Tt=1 ItT

(5.6)

This mean image can then be subtracted from each ROI image It as it is presented to

the static feature extraction stage:

I′t = It − I (5.7)

While this approach is not suitable to a real time approach to the need to have seen

the entire utterance, a suitable approach could easily be developed if needed through

the use of running averages.

Themotivation behind normalising the ROI in this manner comes from the notion that

a large amount of speaker appearance-based information is collected in the standard

appearance-based feature extraction techniques [29], and that this information would

not be useful for modelling speech events. Of course, it is quite possible that this in-

formation would be useful for the speaker recognition application, so a version of the

cascade will also be testedwithout this normalisation step in Section 5.6 to investigate

this effect.

Static feature extraction

Once the ROI has been mean-normalised, static visual speech features can then be

extracted. As has been mentioned previously in Section 5.4, the main aim of fea-


ture extraction is to provide compression of the raw pixel values in the ROI whilst

still maintaining good separation of the differing speech events. DCT-based feature

extraction was chosen for Potamianos et al.’s original cascade [147] as well as this im-

plementation in this thesis. This method of feature extraction was chosen because it

had previously been shown to work equally as well as major alternative feature ex-

traction techniques like PCA, and slightly better than DWT [147]. DCT (and DWT)

also have the added advantage that they do not require the extensive subspace train-

ing needed for PCA-based feature extraction, as these algorithms are not based on any

prior knowledge of the ROIs.

Given an input grayscale ROI of size L× H pixels, the two dimensional DCT can be

defined at position x,y can be defined as

Ft (x,y) =

√

2L

√

2Hcxcy

L−1

∑i=0

H−1

∑j=0

I′t (i, j) cos[(2i + 1) xπ

2L

]

cos[(2j + 1)yπ

2H

]

(5.8)

where I′t (x,y) is the grey-scale value at position x,y of the mean normalised ROI, and

the cx and cy coefficients are defined by

ca =

1√2

a = 0

1 a 6= 0(5.9)

The DCT produces a representation of the original ROI in the frequency domain, with

the DCT calculated in (5.8) resulting in a image of identical dimensions as the original

ROI. As the DCT produces a full representation of the original image, in the frequency

rather than spatial domain, no loss of information occurs and the application of an

inverse-DCT can transform Ft back into the original normalised ROI, I′t flawlessly.

Although a complete DCT transformation provides no compaction of the ROI, the

transform does have a useful property that results in most of the energy residing in

the low-order coefficients. Using this characteristic, most of the transformed image,


Figure 5.7: Most of the energy of a 2D-DCT resides in the lower-order coefficients, andcan be collected easily using a zig-zag pattern.

Ft, can be discarded with little loss of information. One of the most common tech-

niques of extracting just the lower-order coefficients is through the use of a zig-zag

scheme, developed for use in the JPEG image compression scheme [183]. The zig-zag

scheme, shown in Figure 5.7, is designed to keep DCT coefficients of similar frequen-

cies together, and by choosing the first DS coefficients of the DCT images through this

scheme, a compact representation of the original ROI can be realised:

oSt =

[

Ft

(zx (1) ,zy (1)

), . . . ,Ft

(

zx

(

DS)

,zy(

DS))]

(5.10)

Where zx (d) and zy (d) are defined based on the zig-zag illustrated in Figure 5.7:

zx

zy

=

1 2 1 1 . . .

1 1 2 3 . . .

(5.11)

Determining the number of coefficients (DS) that should be extracted from the DCT

images consists of a trade-off between the amount of information available versus the

complexity required to train classifiers in higher dimensional spaces. For this thesis

DS was chosen as 20 based on empirical experiments. For evaluation of the static

features, delta and acceleration components were added to result in a 60 dimensional

feature space, but only the primary 20 features were used as input to the dynamic

feature extraction stage.

In Potamianos et al.’s original implementation an application of speech-event based

LDA was performed upon the DCT features before the dynamic feature extraction


stage, but this stage was not applied for this thesis, as it added complication and was

found to not provide significant benefit over just applying the speech based LDA dur-

ing dynamic feature extraction.

Dynamic feature extraction

To extract the dynamic visual features that have been shown to improve human per-

ception of speech, this stage of the cascade extracts LDA-based features over a range

of consecutive ROIs. An overview of this stage is shown as the bottom row of Fig-

ure 5.6, and is identical to the dynamic stage of Potamianos et al.’s [147] cascade. To

allow the LDA to be used to extract the dynamic, rather than static, features of the

visual speech,±J consecutive frames were concatenated around the frame under con-

sideration before the LDA transformation matrix could be calculated. For this thesis,

J = 3 was found to provide the best balance between the amount of information cap-

ture and the size of the resulting LDA transformation matrix [102]. The input to the

LDA algorithm for the concatenated ROI features around oSt is given as

oCt =

[

oSt−J , . . . ,o

St , . . . ,o

St+J

]

(5.12)

It can be seen that this results in a feature vector of size DC = (2j + 1)DS, where DS is

the dimensionality of the original static vector.

The aim of the LDA algorithm is to arrive at a suitable transformation matrix WDLDA

that will provides the best separation over a range of known classes. LDA training

is based upon a set of N training examples XC ={oC1 , . . . ,o

CN

}and a set of matching

class labels L = {l1, . . . , lN} where each ln is between 1 and the number of possible

classes A. For the implementation of the cascade used for this thesis, the class labels

are calculated by force-aligning word models trained on the PLP-based acoustic fea-

tures against the known transcription of the training sequences. That is, each video

observation is labelled with the word (including the ‘silence’ meta-word) and state it


appears within based on the acoustic models.

The LDA transformation matrix is calculated such that the within-class dispersion is

minimised, while the between-class distance is maximised. To allow these goals to be

met, the within-class scatter matrix Sw and the between-class scatter matrix Sb are

defined based on the statistics of the training data around their own class and around

the whole distribution respectively. The within class matrix is defined by

Sw =A

∑a=1

P (a)Σa (5.13)

where P (a) is the likelihood of class a occurring based on the labelled training data,

and Σa the covariance matrix of the ath class. The between-class scatter matrix is then

defined as

Sb =A

∑a=1

P (a) (µa − µ0) (µa − µo)T (5.14)

where µa is the mean of the ath class, and µ0 is the global mean over all classes given

by

µ0 =A

∑a=1

P (a)µa (5.15)

The transformation matrix can then be found by maximising [126]

Q(

WDLDA

)

=

∣∣∣WD

LDASb

(WD

LDA

)T∣∣∣

∣∣∣WD

LDASw

(WD

LDA

)T∣∣∣

(5.16)

where |x| denotes the determinate of matrix x. In a similar manner to PCA, (5.16) can

be solved by calculating the eigenvalues and eigenvectors of the matrix pair (Sb,Sw)

such that SbF= SwFD, where F contains the eigenvectors as its columns, F= [f1, . . . , fDC ]

and the DC highest eigenvalues form the diagonals of D. The LDA transformation


matrix can then be defined simply as the transpose of the eigenvector matrix, and the

eigenvalues are discarded:

WDLDA = FT (5.17)

Due to the dimensionality of the LDA transformation matrix is identical to that of the

data being transformed, LDA does not performwell on high dimensional data such as

raw images [8] due the computational difficulty in calculating such a large matrix. It is

for this reason that the LDA stage of the cascade is based upon theDCT transformation

of the normalised ROIs calculated earlier in the cascade. Even with the concatenation

of the ±J adjacent frames, DC is still much smaller than the number of pixels in the

normalised ROI region.

The set of training examples, XC, is taken from the training sessions of the speech

processing framework developed in Chapter 3. Ideally, these training observations

would be separate from the set of observations used to train and test the speech mod-

els. However, this was not able to be achieved due to the limited data available in the

XM2VTS database. Although the training sequences for the LDA transformation and

the word models are the same, neither of these operations have any knowledge of the

testing sequences, which should allow for a valid evaluation of their performance in

testing. Because the output of this cascade is based on training examples of the mean-

removed DCT static features, an additional complication is introduced that a differing

set of cascading appearance-based features must be formed for each unique training

configuration under the speech processing framework.

Once the LDA transformation matrix WDLDA was calculated using the training se-

quences, it can then be used to transform the static observation vectors from the DCT

stage of the cascade to form the dynamic visual speech features used to train and test

the models for speech and speaker recognition. Before applying the transformation

matrix on the concatenated static features, the dimensionality of the output dynamic

features can be limited by only choosing the first DD eigenvectors inWDLDA to arrive at

W′DLDA. Given a concatenatedmean-removedDCT feature vector, oC

t , the final dynamic

5.6 Comparing speech and speaker recognition 103

speech vector can be calculated by vector multiplication with this matrix to arrive at

the final dynamic speech feature vector, oDt :

oDt = W′D

LDAoCt (5.18)

This dynamic visual speech vector is the final stage of the cascading appearance-based

feature extraction technique, and will be the basis of the visual features used through-

out the remainder of this thesis.

5.6 Comparing speech and speaker recognition

It is clear from existing research that the dynamic information contained in visual

speech is of the most importance for the task of visual speech recognition [126, 145].

However, while dynamic information has been shown to perform well for the task of

visual speaker recognition [123, 20], it is not clear that actively removing static infor-

mation, which can be a useful pre-processing stage for speech, is a sensible decision

for the speaker recognition application. To this end, this section will look at visual

features extracted from the static and dynamic stages of the cascade outlined in Sec-

tion 5.5 for both the speech and speaker recognition tasks. Additionally, the cascade

will also be tested without the image normalisation stage to determine whether the

normalisation, and resulting removal of static speaker information, has any effect on

the final speaker recognition performance.

The training and testing of the models used for speech and speaker recognition in this

section will be based on the framework developed in Chapter 4 on all 12 configura-

tions of the XM2VTS database. The performance of the various stages of the visual

feature extraction cascade will first be presented for the speech recognition task to

confirm the speech recognition ability of the cascade-derived features on the XM2VTS

database with Potamianos et al.’s earlier work [147]. Potamianos et al.’s work on this

104 5.6 Comparing speech and speaker recognition

cascade have primarily been tested on proprietary databases, and an evaluation of

these features on the publicly available XM2VTS database should allow for a good

baseline for future visual speech research.

The same set of features will then be evaluated for the speaker verification task to

determine the utility of the cascading appearance-based features for speaker recog-

nition applications. Such features have only been used for speaker recognition only

once in the literature [125], and were not studied in detail at the time. This chapter

aims to rectify this situation, and provide a detailed comparison of dynamic and static

video features for speaker recognition, based on both image-normalised and raw DCT

static features. Finally conclusions will be drawn from both the speech and speaker

recognition performance as to the dynamic nature of visual speech.

5.6.1 Feature extraction

While the main focus of this section will be on the suitability of dynamic visual speech

features for speech and speaker recognition, acoustic features will also be evaluated

to serve as a baseline. For these experiments the acoustic features will be PLP- and

MFCC-based features extracted from the raw acoustic signal every 10 milliseconds

over 25 millisecond Hamming windows. Both acoustic features were based on the

first 12 Mel-frequency banks, and added to an energy coefficient to result in 13 static

features. Delta and acceleration features were then appended to arrive at an 39-

dimension acoustic feature vector for each window. These feature will be referred

to as the A-PLP and A-MFCC features throughout these experiments.

Video features for these experiments were gathered from both the static and dynamic

stages of the appearance-based cascade described earlier extracted from the manually

tracked ROIs.

Two versions of the static features from the cascade will be evaluated in this sec-

tion, one based on the image normalised grayscale ROIs, and one based on the un-

5.6 Comparing speech and speaker recognition 105

normalised grayscale ROIs. 20-dimensional DCT-based static feature extraction is

performed on these ROIs as described in Section 5.5.2, and deltas and acceleration

coefficients are appended to arrive at a 60 dimensional visual feature vector for each

video frame. The dimensionalities chosen here were based on tuning experiments

and experiments performed by Lucey [102]. The image-mean-normalised DCT and

un-normalised DCT features will be referred to as V-MRDCT and V-DCT features

throughout these experiments.

The two static features vectors are then used as the basis of the dynamic features ex-

traction. The 7 static feature vectors (not including deltas and accelerations) surround-

ing and including each video frame were concatenated. This concatenated feature

vector then underwent LDA-based feature reduction based on speech events deter-

mined by force aligning the A-PLP features with a known transcription. The resulting

image normalised and non-normalised dynamic features will be referred to as V-LDA-

MRDCT and V-LDA-DCT respectively throughout these experiments.

The A-PLP, A-MFCC, V-DCT and V-MRDCT features were designed such that they

could be extracted from any given utterance without any prior knowledge of the type

of data they were working with, allowing their feature vectors to be used for each

of the configurations of the XM2VTS database. However, because the LDA-derived

features, V-LDA-MRDCT and V-LDA-DCT were trained based on acoustic speech

events in the training sessions of the framework, each unique training configuration

of the framework had to use a differing set of LDA-derived visual feature vectors.

As a result, each sequence being tested had 6 different feature representations for

V-LDA-MRDCT and V-LDA-DCT based upon which framework configuration was

being tested.

5.6.2 Model training and tuning

To evaluate the speech and speaker recognition performance two different model

types had to be trained for each of the 6 training configurations of the AVSP frame-

106 5.6 Comparing speech and speaker recognition

HMMDatatype States MixturesA-MFCC 11 8A-PLP 11 8V-DCT 9 16

V-LDA-DCT 9 16V-MRDCT 9 16

V-LDA-MRDCT 9 16

Table 5.1: HMM topologies used for the uni-modal speech processing experiments.

work. These models as follows:

1. background word models

2. speaker word models

Under the AVSP framework developed in Chapter 4, these two models would allow

both speaker-independent and speaker-dependent continuous speech recognition, as

well as performing text-dependent speaker verification.

For this particular implementation of the framework, the word models were imple-

mented as left-to-right HMMs as described in Chapter 3. The topologies of the mod-

els were tuned by evaluating the speech and speaker recognition performance on

a single training configuration of the XM2VTS database, and these topologies were

kept for all remaining configurations. Table 5.1 shows the tuned topologies for each

datatype tested in these experiments. In the process of tuning the HMM topologies

it was discovered that the best performing topologies for speech and speaker recog-

nition tended to be very similar. This was fruitful as it allowed the same models to

be trained and then tested on both tasks, as intended for the AVSP framework, rather

than having to train a separate set of models for speech and speaker recognition.

An additional parameter that required tuning was the MAP-adaptation factor, or τ,

for adapting the speaker word and text-independent speech models from the equiv-

alent background models. This factor controls the relative importance of the existing

5.7 Speech recognition experiments 107

WER (%)Datatype SI SDA-MFCC 4.65 2.72A-PLP 4.05 1.06V-DCT 52.77 18.84

V-LDA-DCT 33.88 10.24V-MRDCT 39.01 15.15

V-LDA-MRDCT 27.90 8.22

Table 5.2: WERs for speech recognition on all 12 configurations of the XM2VTSdatabase

background models as compared to the data being adapted towards. Tuning of this

parameter was performed in a similar manner to the topologies, but only for the task

of speaker verification. A MAP-adaptation factor of τ = 0.75 was found to perform

well over all datatypes and was therefore chosen for these experiments.

5.7 Speech recognition experiments

5.7.1 Results

Speech recognition experiments were performing both using the background word

models and the speaker specific models. Results were reported using word error rates

(WER) calculated for each configuration of the database. The relative performance of

each datatype was found to be similar between differing database configurations, and

so each datatype’s WER was reported as the average over all 12 configurations. The

SI and SD average WERs are shown in Table 5.2.

5.7.2 Discussion

The SI speech recognition results reported in Table 5.2 confirm the relative improve-

ments shown by Potamianos et al.’s [147] original cascade, but SD speech recognition

results against such a cascade have not yet been published in the literature. This sec-

108 5.7 Speech recognition experiments

tion will discuss these results in some detail, focusing on the effect of the normalisa-

tion and LDA stages of the cascade on the speech recognition WERs. The difference

in speech recognition performance will then be discussed in relation to the speaker-

dependency (or not) of the speech models, followed by a quick examination of the

acoustic results.

Mean-image normalisation

By comparing the V-DCT and V-MRDCT speech recognition WERs in Table 5.2, it can

be seen that by removing the mean image from each utterance, the V-MRDCT features

reduce the WER by 13.76% in the SI case. That such a large improvement in the WER

can come by simply removing the mean image suggests that the speaker and environ-

ment specific information contained in the mean image are hindering the SI speech

recognition performance due to variations in this information between speakers and

and even between sessions with specific speakers.

Although the speaker variations should not be a significant problem with SD speech

recognition (as each model is adapted to the target speaker), the V-MRDCT still pro-

vided a 3.69% improvement in theWER over the V-DCT features. This improvement is

likely to be related to normalising other environmental variations in the utterances be-

tween the differing training and testing sessions of the database configurations. Some

examples of such variations may be changes in lighting or positioning of the speaker,

or if the facial hair, glasses or makeup of the speaker has changed between the training

and testing sessions.

One additional factor that may relate to the improvement of the V-MRDCT SD speech

recognition performance is the underlying improvement in the SI background model

before the adaptation to the SD models.


Dynamic feature extraction

The application of speech-event-based LDA for the V-LDA-DCT and V-LDA-MRDCT

features in the feature extraction cascade provides further improvements in the speech

recognition WER over the underlying V-DCT and V-MRDCT datatype.

Referring back to Table 5.2, it can be seen that for SI speech recognition, the un-

normalised V-LDA-DCT features provide an 18.89% decrease in the WER over the

original V-DCT features. However, the best performance comes from the application

of speech-event-based LDA to the already normalised V-MRDCT features providing

a further 11.11% WER decrease, resulting in the best SI visual speech recognition per-

formance.

The LDA algorithm was designed to choose features that best discriminate between

a set of classes, which for these experiments were the transcribed words models and

their states. For this reason, it can be seen as a more intelligent form of removing

irrelevant features than was attempted using mean image normalisation earlier in the

cascade, as features of the ROI that do not vary with the speech are unlikely to be

included in the LDA-transformed features. The small improvement in the V-LDA-

DCT over the V-MRDCT features shows that the discriminative nature of the LDA

feature extraction can outperform the more brute force removal of the mean image.

Of course, there is no reason not to perform both the normalisation and discriminative

stages of the cascade, resulting in the best-performing V-LDA-MRDCT features.

Similar, but of lesser magnitude, performance increase is obtained for the LDA stage

of the cascade for the SD speech recognition results, with an 6.93% improvement in

the image normalised results and 8.6% for the un-normalised. In a similar manner to

the SD performance increase of mean-image-removal, this improvement is likely to be

related to variations in environmental and speaker’s appearances between the training

and testing sessions, as well as the possibility of the improvement in the SI background

models being passed through to the speaker adapted speech models. Similar to SI

speech, the best performing SD speech recognition performance is provided by the


final stage of the cascade, with theWER of 8.22% starting to approach that expected of

acoustic features, suggesting that in controlled environments such features could be

usable without the need for any audio at all.

Speaker dependency

For all of the datatypes tested in Table 5.2, it can be seen that the speech recognition

WER is decreased by at least a half when SD speech models are used. This decrease in

performance for the SI case is obviously related to the variation in speakers between

the training and testing sets that is not a factor for SD speech recognition, where each

speech model is adapted directly to the target speaker.

While the acoustic WERs are only separated by a few percent between the SI and

SD cases, the best performing V-LDA-MRDCT features have a much wider gulf be-

tween the two possible speech recognition configuration. Even though the entire

appearance-based cascade provided a major improvement in speech recognition per-

formance over the raw V-DCT features, there is still a large amount of speaker specific

information as seen by the 19.68% decrease in WER gained by using SD models. This

suggests that there is still much room for improvement in the development of video

features for SI video speech recognition, and any such improvements are likely to

move video speech recognition towards where it can reasonably be used uni-modally

in controlled conditions.

Acoustic performance

It can be seen that both the A-PLP andA-MFCC features workwell for speech recogni-

tion, which is reflected by their widespread use in mature acoustic speech processing

research [157, 151]. While both features appear to work equally well when the training

and testing speakers are mismatched in SI speech, the A-PLP features appear to adapt

better to the individual speaker word models, resulting in a improved SDWER when

5.8 Speaker verification experiments† 111

compared to the A-MFCC results.

5.8 Speaker verification experiments†

5.8.1 Results

Speaker verification experiments were performed on the same set of features as the

speech recognition experiments in the previous section. Speaker verification scores

were calculated by comparing scores obtained with the speaker specific models and

the background models and plotting the difference between the two using DET plots

to investigate the relative false alarm rate and misses that can be obtained with each

datatype under consideration. In a similar manner to the speech recognition exper-

iments, and to ensure that enough scores are available to accurately evaluate the

performance of each feature-extraction method, all 12 configurations of the XM2VTS

database were evaluated for speaker verification.

The results of the text-dependent speaker verification experiments using the speaker

dependent and background word HMMs are shown in Figure 5.8. It can be seen that

all of the visual features are providing speaker verification performance in a similar

range to the acoustic features, which is quite different than for speech recognition,

where both audio feature extractionmethods were clearly better than any of the visual

features. The ability of the A-PLP features to better represent individual speakers

when compared to A-MFCC features is also clearly visible here.

5.8.2 Discussion

Figure 5.8 shows that even though the original intention of the appearance-based

cascade was to remove speaker specific information and emphasis the speech-event-

specific, all stages of the cascade actually perform better for visual speaker verification

112 5.8 Speaker verification experiments†

0.1 0.2 0.5 1 2 5 10 0.1

0.2

0.5

1

2

5

10

False Alarm probability (in %)

Mis

s pr

obab

ility

(in

%)

A−MFCCA−PLPV−DCTV−LDA−DCTV−MRDCTV−LDA−MRDCT

Figure 5.8: Text dependent speaker verification performance on all 12 configurationsof the XM2VTS database.

than the V-DCT features theoretically containing the most static speaker specific infor-

mation. Even in the relative improvements between the various video datatypes, the

improvements in video speaker verification performance appear to match those in

video speech recognition, suggested that (of the features tested here) the best video

features for both speech and speaker recognition are the same.

These results suggest that for visual speaker recognition, the (dynamic) behavioural

nature of speech could be more important than the (relatively static) physiological

characteristics [14]. That is, it may be easier to recognise speakers by how they speak,

than by their appearance while they speak. This also has the benefit that because static

appearance is less important, environment conditions such as illumination, andwithin

speaker variations such as facial hair or makeup, become less of an issue provided

they do not change throughout an utterance, and as long as the extraction of dynamic

features can still be performed adequately.

5.9 Speech and speaker discussion 113

5.9 Speech and speaker discussion

Interestingly, these experiments show that the same features that have been shown

to perform well for the task of speaker-independent speech recognition by other re-

searchers [147, 126] also perform well for speaker-dependent speech recognition and

speaker recognition. All of the stages of the cascade that provided benefits by remov-

ing speaker- and session-specific information also provided similar benefits for the

speaker verification experiments, even though the original intent of the cascade was

to provide a form of normalisation across speakers and subsequently improve speech

recognition in unknown speakers.

However, as demonstrated by the speaker-dependent speech recognition results re-

ported earlier, this normalisation effect of the cascade still had benefits even when the

speech was tested on the same speaker as the training, in part due to the normali-

sation of within-speaker session variability such as illumination, facial hair, makeup

and other factors. Of course, it stands to reason that if individual speaker’s word

models can provide a significant improvement in speech recognition over the back-

ground speaker-independent models, then comparing a number of individual sub-

jects word models against a given utterance can allow conclusions as to the identity of

the speaker.

That the speaker verification experiments improved in performance as static informa-

tion was removed suggests that dynamic visual information can play a very important

role in visual (and audio-visual) person recognition, particular when the facial move-

ments are speech related.

Of course, face recognition is a very mature area of research that has shown that static

recognition of faces can provide good performance, and the possibility certainly ex-

ists of using a combination of static face and dynamic features to represent the visual

modality with a minimum loss of information. Some promising versions of such sys-

tems have been developed [123], but this area is still a relatively new area of research.


5.10 Chapter summary

This chapter has covered the broad fields of acoustic and visual feature extraction for

AVSP. The first half of this chapter covered a review of the state-of-the art in visual

feature extraction for AVSP. A brief overview of visual front ends for localisation and

tracking of lip ROIs was provided, followed by the manual tracking approach that

was chosen for this thesis to attempt to avoid the front-end effect. A more detailed

review of visual feature extraction techniques was conducted covering appearance

and contour based extraction as well as combinations of the two. A review covering

the extraction of dynamic visual features to better model the nature of visual speech

was then conducted, with particular focus on the cascading approach first suggested

by Potamianos et al.’s [147] for continuous speaker-independent speech recognition.

The final half of the chapter was devoted to experimental evaluation of visual features

extraction from the various stages of the dynamic appearance-based cascade for both

the speech recognition and speaker verification tasks according to the framework de-

veloped in Chapter 4. These results confirmed the results found by Potamianos et

al. and other researchers [126] for speaker-independent speech recognition, but also

showed good performance as the cascade progressed for speaker-dependent speech

recognition and both text-dependent and independent speaker verification. These ex-

periments showed that even though the cascade was intended to remove speaker (and

session) specific information to improve speaker-independent speech recognition, the

dynamic information extraction works very well for the recognition of speakers as

well, suggesting that visual speech could be considered to be more of a behaviour

rather than physiological characteristic for the purposes of recognising speaking per-

sons.

Chapter 6

Simple Integration Strategies

6.1 Introduction

This chapter will present two simple integration strategies that can be performed us-

ing the existing classifier methods and techniques developed in Chapter 3. As these

strategies do not modify the existing classifier techniques, these techniques focus on:

1. Fusing the speech features before classification, or

2. Fusing the output scores after classification

These two techniques will be referred to as early and late integration respectively

throughout this thesis. Both of these techniques have been used extensively in the

AVSP literature and a brief review will be conducted in the beginning of this chapter

to illustrate and compare both techniques for audio-visual speech and speaker recog-

nition.

In the final half of this chapter, audio-visual speech and speaker recognition experi-

ments will be conducted using these simple integration strategies to serve as a com-

parative baseline for the novel SHMM experiments conducted later in the thesis.

116 6.2 Integration strategies

6.2 Integration strategies

The study of audio-visual fusion for speech processing is a subset of the broader field

of research referred to as sensor fusion [32]. Sensor fusion covers any research in-

volved in extracting information from multi-sensor environments through some form

of integration of the multi-sensor data. Since the earliest research into sensor fusion

in the early 1980s [174], this area has been adapted for a wide range of applications,

of which one of the more popular has been the improvement in recognition of hu-

mans and their activities. The most obvious such application would be the identifi-

cation of people using multiple biometrics such as face, fingerprints or voices [161],

but many other such applications can also benefit frommultiple sensor fusion, includ-

ing person tracking [75], expression recognition [197] and, of course, speech process-

ing [126, 23, 25].

One important method of characterising methods of sensor fusion is based on where

the integration of the information obtained frommultiple sensors occurs, of which the

main levels can be defined as

• Early integration, where the raw sensor data or features extracted from this data

are combined before classification,

• Middle integration, covering classifiers inherently designed to handle data or

features from multiple modalities, or

• Late integration, where scores or decisions of individual classifiers for each sen-

sor are combined.

The choice of the level of integration often comes down to the type of sensors used for

a particular application. In particular, late integration is more popular in applications

where the sensors are capturing completely independent data such as the recognition

of a person by their signature and face [161]. Early and middle integration can be

more useful in situations where the sensors are capturing similar information, an ex-

6.3 Early integration 117

ample of which might be combining visible and infrared images of a face to improve

face recognition in adverse environments [92]. Middle integration largely serves as a

‘catch-all’ category for approaches that do not fit the other two approaches, but many

such systems typically involve one or more of the sensors controlling the integration

of all the sensors.

For audio-visual speech processing, all three levels of fusion have been considered for

both of the speech and speaker recognition tasks, and as such, all three integration

strategies will be considered in this thesis. This chapter will investigate the early and

late integration strategies, as they can be implemented simply using the same classifier

design as the individual modelling experiments conducted in Chapter 5. Based on

the results reported that chapter, only A-PLP and V-LDA-MRDCT features will be

considered for fusion in this thesis due to their superior speech and speakermodelling

ability.

As the middle-integration-based SHMM approach is a primary focus of this thesis,

this chapter will only focus on early and late integration, and the middle integration

approach will be deferred until Chapter 7 and 8.

6.3 Early integration

6.3.1 Introduction

As the audio and visual modalities are combined before classification, early integra-

tion is one of the simplest methods of fusion available for AVSP research, and many

audio visual speech [1, 173, 145] and speaker [27, 54, 193] recognition systems have

used this approach for this very reason.

Although early integration can come about through the fusion of either the raw sensor

data or from features extracted from the raw data, only the feature-fusion approach has

been demonstrated as feasible in the literature, primarily due to the large data volumes

118 6.3 Early integration

and registration difficulties involved in combining raw audio and video data [25].

The simplest method of feature fusion is through direct concatenation of the acoustic

and visual features vectors resulting in a single multimodal feature vector [114, 27,

141]. Because the audio and video features are normally captured at differing frame-

rates, some form of interpolation or oversampling is generally used to improve the

video feature rate to that of the acoustic data.

Because a simple concatenation can result in a much larger feature vector than is typ-

ically encountered in a single modality classifiers, some form of feature reduction can

be performed on the concatenated feature vector to reduce the overall size and min-

imise the effect of the ‘curse of dimensionality’. Common feature reduction meth-

ods for this purpose include PCA and LDA [27], with the hierarchical LDA approach

adopted by Potamianos et al. [143] showing particular promise in this area.

Another approach that can be considered a form of early fusion is using visual in-

formation to enhance acoustic features for use in regular acoustic classifiers [60, 146].

By estimating a linear transformation from either the video features alone or a con-

catenation of both modalities, this approach allows a simulated or enhanced acoustic

feature vector to be presented to a regular acoustic speech processing system, allowing

integration of video data into an existing acoustic system with minor modifications.

Twomain forms of early integration speech processing experiments will be conducted

in this thesis, plain concatenative feature-fusion and discriminative feature-fusion.

These experiments will be conducted both to investigate the effectiveness of feature

fusion within the speech processing framework developed earlier, and as a baseline

for the late and middle integration experiments conducted later in this thesis.

6.3 Early integration 119

6.3.2 Concatenative feature fusion

The concatenative feature fusion vectors used under the speech processing framework

were calculated by concatenating each acoustic feature vector with a corresponding vi-

sual feature vector, resulting in the concatenative feature-fusion vector. As the original

datatypes consisted of 39 acoustic and 60 video features, a large feature vector of 99

elements was generated for each original acoustic feature extraction window.

As the video features were not extracted at the same rate as the acoustic features, the

corresponding video feature vector for each acousticwindowwas chosen as the closest

(in time) video feature vector. Each video vector was therefore copied completely and

appended to approximately four acoustic feature vectors, with no estimated interpola-

tion occurring between the video frames. This approach was chosen as physiological

experiments have shown as there is little value in using visual features at frame rates

above 15-20 Hz [56].

The resulting 99-dimensional feature vector were then used to train the full variety

of HMM models defined under the speech processing framework. Because feature-

fusion derived feature vectors can be used identically to individual-modality features,

the full range of speech processing experiments could then be conducted. The HMM

topologies for the concatenative feature fusion features were chosen to be 11 states and

16 mixtures for the HMMmodels, which were chosen because they performed well in

empirical tuning experiments, and to provide a good baseline for comparison to the

similarly configured uni-modal HMMs.

6.3.3 Discriminative feature fusion

One of the major problems with simple concatenation is the large feature vectors that

result from such an approach. As this approach can lead to feature vectors signifi-

cantly larger than the individual feature vectors in the individual modality, the spec-

tre of the ‘curse of dimensionality’ can rise again. However, similar methods used to

120 6.3 Early integration

Figure 6.1: Overview of the feature fusion systems used for this thesis, covering bothconcatenative and discriminative feature fusion.

reduce the size of uni-modal feature vectors can be also be applied to reduce the size

of the concatenated feature fusion vector.

The implementation of such an approach for this thesis is shown in Figure 6.1, and

is modelled after Potamianos et al.’s hierarchical LDA approach [143]. As the feature

reduction is performed by LDA, such a system provides the additional benefit over

feature reduction using PCA or DCT of choosing features based upon their ability to

most efficiently separate speech event classes from one another.

Potamianos et al.’s hierarchical LDA approach was so called because the acoustic and

visual features progressed through a hierarchy of LDA transformations, performed

first on the individual modality feature vectors and then again on the concatenated

feature vector resulting from these features. Although the approach conducted here

does perform LDA feature reduction of the concatenated feature vectors, only the vi-

sual feature vector is also LDA-derived with the acoustic feature vectors left as the

A-MFCC or A-PLP features before concatenation. This approach was chosen because

the regular acoustic features offered good speech-event-separation performance and

any benefit of LDA feature reduction would likely be easily offset by the increased

time and processing required to take such an approach.

The process of transforming the concatenated feature vector into a smaller dimen-

sional space was conducted in an identical manner to the transformation video fea-

6.4 Late integration 121

ture vectors in Chapter 5, although only 5 frame vectors were combined instead of

7 for the video features. The concatenated feature vectors used for this purpose did

not use the deltas or accelerations of the underlying datatypes, and to limit the pro-

cessing and memory required to calculate the LDA transformation matrix, only every

4th concatenated feature vector was considered. Once the LDA transformation matrix

was obtained, it was used to extract the top 24 LDA features for each concatenated fea-

ture vector, and delta and accelerations were added to result in 72-dimensional feature

vectors for the FF-LDA datatype.

This discriminative feature fusion datatype was then used to train the full range of

HMM and GMM models in the speech processing framework, in a identical manner

to the concatenative and individual features. An identical topology was chosen as for

concatenative feature fusion (11 states, 16 mixtures) to allow for easy comparisons to

be made between the two fusion techniques and the uni-modal modelling techniques.

6.4 Late integration

6.4.1 Introduction

One of the limitations of the early integration strategy is that, by combining both the

acoustic and visual modalities into a single feature vector, there is limited ability to

model the reliability of each modality. The ability to explicitly model the reliability

of either modality is very important for both speech and speaker recognition appli-

cations, for the simple reason that the discriminative ability of either modality can

vary widely in real world conditions with either modality behaving differently in the

presence of acoustic noise, visual degradation, tracking inaccuracies and individual

speaker characteristics. Fortunately, most sources of degradation typically affect one

modality to a greater extent from the other, allowing the unaffected modality to hold

up the slack when the other has become degraded. Some examples of such single-

modality degradation might be invisible background noise which would favour the

122 6.4 Late integration

visual modality, or a continuouslymoving speaker causing tracking difficulties, which

would favour the acoustic modality.

Late integration systems combine the outputs of individual classifiers in the acoustic

and visual modalities, allowing the scores or decisions from each classifier to be eas-

ily weighted up or down based on the perceived reliability of either modality before

arriving at a final decision based on both modalities [89]. This approach therefore pro-

vides a simple mechanism for explicitly modelling the reliability of the acoustic and

visual modalities for audio-visual speech processing, and is a active area of research

in both speaker [104, 55, 20] and isolated speech recognition [49, 126, 69].

However, for continuous speech recognition late integration is considerable more dif-

ficult to implement because the sequence of classes must be agreed upon between

the modalities before the decisions of either modality can be combined. An extreme

example of this approach clearly leads back to isolated speech recognition, where the

boundaries of each word or smaller speech event are determined by either the acoustic

speech models or another external source before each modalities classifiers are com-

pared strictly within those boundaries. The other alternative is attempting to choose

the highest scored transcription by combing n-best transcriptions from both modal-

ities, but difficulties can arise if a particular transcription is not represented in both

modalities.

Due to the difficulties of a late integration approach with continuous speech recog-

nition, only speaker verification experiments will be presented to demonstrate the

late integration approach to audio-visual speech processing. An alternative approach

which allows modelling stream reliability within the speech models will be presented

by the middle-integration-based SHMM approach in Chapters 7 and 8.


Figure 6.2: Overview of the output score fusion approach used for this thesis.

6.4.2 Output score fusion for speaker verification

The late integration approach to speaker verification will be demonstrated in this the-

sis using weighted sum score fusion of the output scores of the individual classifiers

in the acoustic and visual modalities, including normalisation of the underlying score

distributions before combination. This approach is depicted in Figure 6.2.

While a speaker identification approachwould require thatmany scores are normalised

and fused before they can be ranked, the verification approach chosen for speaker

recognition in the speech processing framework developed in Chapter 4 has the ad-

vantage that each test utterance can be represented by a single score for each modality.

The output score fusion score can easily be calculated from these two scores using

s f = αZa (sa) + (1− α)Zv (sv) (6.1)

where sa and sv are the output scores of the audio and video classifiers, Za and Zv

the score-normalisation functions, and α is the weighting parameter from which the

individual stream weights are calculated.

As the scores from the HMM and GMM classifiers are given as log likelihood scores,

the choice of weighted sum fusion corresponds to exponentially weighted product

fusion of the likelihoods, but most classifier and fusion strategies operate on the log-

likelihood domain to avoid having to calculate exponentials and deal with multiplica-


tion of very small magnitude likelihoods accurately.

Output score fusion must wait for the individual classifiers in each modality to finish,

and the individual scores sa and sv are therefore gathered over the entire utterance.

If a decision is required to be reached on smaller regions that the entire utterance,

some form of segmentation must first be used to limit the original period on which

the classifiers are evaluated.

For the experiments presented later in this chapter, the individual scores before fusion

were identical to those used for speaker verification in Chapter 5, and are already nor-

malised against the background models, and so no further background-speech nor-

malisation is required after the output-score fusion.

6.4.3 Score-normalisation

Score normalisation is a technique used in multimodal biometric systems to com-

bine scores from multiple different classifiers [78] that may have very different score

distributions. By transforming the output of the classifiers into a common domain,

the scores can be fused through a simple weighted combination of scores, where the

weights canmore accurately represent the true dependence of the final score on the in-

dividual classifiers. In this section the zero normalisation [78] method will be demon-

strated for the purpose of normalising audio and video classifiers scores before fusion

can occur.

Zero normalisation transforms scores from different classifiers that are assumed to be

normal into the standard normal distribution N ∼ (µ = 0,σ2 = 1

)using the following

function for each modality i:

Zi (si) =si − µi

σi(6.2)

where si is an output score from the classifier from distribution S such that S∼ (µi, σ2

i

).


−10 −8 −6 −4 −2 0 2 40

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Score

Fre

quen

cy

A−PLPV−LDA−MRDCT

(a) No normalisation

−4 −3 −2 −1 0 1 2 3 40

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

ScoreF

requ

ency


(b) Zero normalisation

Figure 6.3: Histograms of speaker verification scores (a) before and (b) after normali-sation.

The estimated normalisation parameters µi and σi are typically calculated on a left-

out evaluation portion of the data. Then during recognition, the original scores si are

replaced in the fusion equations with Zi (si) as shown in (6.1).

For the implementation of zero-normalisation conducted in this thesis, each of the test-

ing partitions in the 12 testing configurations of the XM2VTS database defined in the

speech processing framework used the corresponding evaluation partition to estimate

the normalisation parameters. For these experiments, the acoustic (and video) normal-

isation parameters were calculated in clean conditions, and these clean normalisation

parameters were used for all testing observations, including the acoustically noisy

versions.

An example of the effect of zero-normalisation on the first testing configuration for

text-independent speaker verification is shown in Figure 6.3 as a histogram of the ob-

served likelihoods of getting each score from each modality over entire utterances.

From Figure 6.3(a) it can be seen that while both the original acoustic classifiers have

a similar distribution, the variance of the video classifiers scores is around twice that

of its acoustic counterparts, resulting it having a comparatively large impact on the fi-


nal fusion score. By normalising the means and variances of the individual classifiers

distributions to match N (0,1), as shown in Figure 6.3(b) both modalities can be con-

sidered equal before each modality is weighted according to the chosen environment

or application.

6.4.4 Modality weighting

The primary benefit of late integration over early integration is that the reliability

of each modality can be modelled simply through the use of multiplicative stream

weights applied before the individual scores are combined to form the fusion score.

While it is certainly possible for both stream weights to be represented using indi-

vidual weights γa and γv for the audio and video streams respectively, the common

convention is to have all weights add to unity, and therefore both weights can be rep-

resented using a single weighting parameter, α:

γa + γv = 1 (6.3)

γa = α (6.4)

∴ γv = 1− α (6.5)

In an ideal fusion system the value of the weighting parameter α should be adaptive to

the prevailing conditions, such that the reliability of each modality can be estimated

on an utterance or even second-by-second basis and the reliance of the fusion on ei-

ther modality can easily vary based on this estimation. This is a relatively new area of

research in audio-visual speech processing, although a number of efforts have taken

place based upon SNR estimates [49], entropy measures [118], or the degree of acous-

tic voicing [126]. One of the more popular methods for adaptive fusion were based on

somemeasure of the perceived quality of the individual classifiers [186, 55]. However,

most audio visual speech processing systems dealing with modality weighting pa-

rameters generally use a training or evaluation partition to determine the best weight

for eachmodality on data similar to that under test, and have used such weights for all


0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

α

Equ

al E

rror

Rat

e (in

%)

0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)

Figure 6.4: Performance of weighted output score fusion for speaker verification as αis varied from 0 to 1.

of the testing utterances, with no direct consideration of the individual environmental

conditions in each utterance.

To determine the modality weights for the comparative experiments in this chapter,

output score fusion experiments were conducted using the A-PLP acoustic classifiers

alongside the V-LDA-MRDCT visual classifiers in performing text-dependent speaker

verification experiments over the entire speaker verification framework. The weight-

ing parameter α was varied from 0.0 to 1.0 in increments of 0.1 and the EERs of each

normalised output fusion combination was recorded for each weighting parameter

over all acoustic noise levels.

The results of these tuning experiments are shown in Figure 6.4 for text-dependent

speaker verification. The verification EER at all acoustic noise levels is shown, as well

as the average performance of each α over all noise levels. To allow the late integration

system to perform well unsupervised over all noise levels, the average performance


of all noise levels was also calculated for each α and is shown as a dashed line.

In order to compare the late-integration system presented here with the feature-fusion

systems, the late integration system had to be designed such that it could be run over

all noise values unsupervised. While the optimal approach would result from a sys-

tem that could adaptively estimate the noise level present in each utterance and chose

an appropriate α, such an adaptive system is non-trivial to implement [69]. Accord-

ingly, it was decided to simply choose the value of α which had the lowest average

EER over all noise levels, which can be seen in Figure 6.4 to be α = 0.2.

The choice of the weighting parameter should reflect the environment that the final

speech processing experiments will be running in. In this case, because the visual and

clean-acoustic speaker verification EERs are already very low, more attentionwas paid

to the fusion performance in noisy acoustic conditions. By choosing the best average α

over all noise levels from the weighting experiments, an output-fusion system biased

towards noisy conditions resulted because the larger variances of the nosier 0 and 6 dB

SNR α-curves pulled the best α down in comparison to the relatively shallow α-curves

of the cleaner 12 and 18 dB SNR fusion experiments.

6.5 Speech recognition experiments

6.5.1 Results

The speech recognition experiments using the feature-fusion datatypes described ear-

lier were conducted according to the speech processing framework, and therefore in

an identical manner to the individual modality speech recognition experiments re-

ported in Chapter 5. Both speaker independent experimentswere performed using the

background HMMs and speaker dependent experiments using the HMMs adapted to

each speaker being tested, and are shown in Figures 6.5 and 6.6 respectively. The V-

LDA-MRDCT video speech recognition performance, which was unaffected by acous-


0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80

Signal−to−noise ratio (dB)

Wor

d E

rror

Rat

e

A−PLPV−LDA−MRDCTConcatenative FusionDiscriminative Fusion

(a) Speaker independent

Figure 6.5: Speaker-independent feature-fusion speech recognition performance aver-aged over all 12 configurations of the XM2VTS database.

tic noise, is also shown for comparative purposes for both plots.

As has already been discussed, output score fusion experiments were not performed

for speech recognition, but similar benefits can be seen in the SHMMmodels described

later in this thesis.

6.5.2 Discussion

From an examination of the results presented in Figures 6.5 and 6.6, it would appear

the discriminative feature-fusion features do have a benefit over the concatenative

feature-fusion for SNRs above 6dB, although neither of the feature fusion techniques

could provide an improvement over the acoustic-only speech recognition experiments

for the cleaner conditions.


0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80


Wor

d E

rror

Rat

eA−PLPV−LDA−MRDCTConcatenative FusionDiscriminative Fusion

(a) Speaker dependent

Figure 6.6: Speaker-dependent feature-fusion speech recognition performance aver-aged over all 12 configurations of the XM2VTS database.

Below 6dB SNR, the discriminative feature fusion speech recognition performance de-

grades compared to concatenative feature fusion for all of the experiments performed

above. This is likely to be due to the effect of extreme train/test mismatch by testing

the speech recognition performance on 0dB SNR acoustic data when the models were

trained in clean conditions. While such a large mismatch also has a similar effect on

the concatenative fusion performance, the mismatch only affects the acoustic features

in the concatenation. By comparison the LDA process has the effect of combining the

acoustic and visual features within every single feature of the discriminative feature

fusion vector, causing the train/test mismatch to have an detrimental effect on the

entire feature vector rather than just the acoustic portion. These results are similar to

those reported by Potamianos et al. [145], where they found that concatenative fusion

began to outperform discriminative for their experiments at around 0 dB SNR.

6.6 Speaker verification experiments 131

From these experiments, it can be seen that not one feature-fusion experiment pro-

vided better speech recognition performance than both individual modalities at all

acoustic noise levels. This effect, when a fusion system is outperformed by one of its

individual components, is referred to as catastrophic fusion and should be avoided as

this condition means that better performance could be obtained using an individual

system. For the speaker independent speech recognition experiments the discrimi-

native features were only catastrophic for 0dB SNR, and the concatenative were only

non-catastrophic (compared to audio) for 0 and 6 dB SNR. In the speaker dependent

experiments both the concatenative and discriminative feature fusion experiments

were only non-catastrophic at 6 dB SNR, and were improved upon at all other points

by either the acoustic or visual features. In fact, over all of the speech recognition

experiments reported above, the only point where better or equivalent performance

couldn’t be obtained using an individual modality was at the 6 dB SNR point. At all

other SNRs, the best performance could be obtained using either the acoustic or the

visual modalities alone.

6.6 Speaker verification experiments

6.6.1 Results

Speaker verification experiments were performed according to the speech processing

framework, using both the early and late integration techniques introduced in this

chapter. The results of these experiments were recorded using the EERs for each of

the integration techniques at each level of acoustic noise under test, and are shown in

Figure 6.7.

132 6.6 Speaker verification experiments

0 2 4 6 8 10 12 14 16 180

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Equ

al E

rror

Rat

e (in

%)

A−PLPV−LDA−MRDCTConcatenative FusionDiscriminative FusionOutput Score Fusion

Figure 6.7: Simple integration strategies for text-dependent speaker verification overnoisy acoustic conditions.

6.6.2 Discussion

From a comparison of the three fusion systems shown in Figure 6.7 (concatenative

feature-fusion, discriminative feature-fusion and output-score fusion), it can easily be

seen that the best performance can be obtained using the late integration approach

of output-score fusion. While the discriminative feature-fusion system does perform

slightly better in cleaner conditions, the very poor performance in noisy conditions

makes both feature-fusion systems unsuitable for speaker verification in many real

world conditions.

The late integration approach can produced better speaker verification performance,

particularly at high noise levels, due to the ability to model the reliability of the acous-

tic and visual classifiers using the stream weighting parameter α. Even better late

integration performance could be obtained by allowing the weighting parameter to

6.7 Speech and speaker discussion 133

be varied based on the prevailing environmental conditions, as can be seen by look-

ing back at the best performing points of the α-curves in Figure 6.4. However, being

able to take advantage of these performance increases in an unsupervised manner re-

quires that the noise level can be estimated and an appropriate weighting parameter

determined automatically.

Within the early integration systems, the concatenative approach appears to provide

the best performance in noisy conditions, while the discriminative is slightly better

in clean. However, both early integration systems do improve upon the acoustic uni-

modal performance, but still remain catastrophic when compared to the visual uni-

modal performance in noisy conditions.

6.7 Speech and speaker discussion

In this section, early integration methods of performing both speech and speaker

recognition were investigated. Both concatenative feature fusion and LDA based dis-

criminative feature fusion were considered. In general, the discriminative feature

fusion technique was found to provide an improvement for speech recognition and

speaker recognition in clean conditions, but the discriminative process was found to

remove some robustness to acoustic noise that was present in the concatenative fea-

ture fusion systems.

However, for both the speech and speaker recognition tasks, neither was found to

provide a major improvement to either the acoustic or visual features across the whole

range of acoustic noise conditions presented here. Assuming that the noise level could

be estimated, in most cases similar or better performance could be obtained by choos-

ing one of the individual acoustic or visual systems over either of the feature fusion

systems.

Additionally due to the need for the uni-modal classifiers to come to a decision before

the output score fusion can occur, such an approach first requires that the acoustic and


visual information being classified be segmented at the level that fusion can occur.

This is not a major issue with speaker verification, as the speaker verification occurs

over the entire utterance, but difficulties can arise if events smaller than an utterance

are being considered for classification.

For this reason, late integration systems designed to recognise speech generally only

work well with isolated-words, and such simple late integration systems are nearly

impossible with continuous speech due to the difficulty in isolating the words before

isolated-word classification can occur. However a group of alternative approaches

that do allow for the benefits of streamweightingwithin the continuous speech paradigm

will be demonstrated with multi-stream HMMs in the next chapter.

6.8 Chapter summary

In this chapter, two simple integration strategies were introduced that can easily be

implemented using existing classifiers techniques by either fusing features before clas-

sification or fusing the output scores after classification. After reviewing existing ap-

proaches to simple integration in the literature both concatenative and discriminative

feature fusion were introduced as a viable early integration strategy for modelling

audio-visual speech. Similarly, for late integration weighted-sum output score fusion

was introduced to allow modelling the reliability of each stream for speaker verifica-

tion applications.

In the second half of this chapter, these simple integration strategies were imple-

mented for speech and speaker recognition to serve as a comparative baseline for the

middle-integration based SHMM experiments conducted in the remaining portions of

this thesis.

Chapter 7

Synchronous HMMs

7.1 Introduction

In the previous chapter, methods of fusing acoustic and visual information both be-

fore and after classification were introduced. While the early integration approach

was found to work well in clean conditions, performance degraded considerably in

noisy acoustic conditions due to the inability to model the reliability of the acoustic

and visual speech features independently. The late integration approach introduced a

method of combining separate acoustic and visual classifier scores, allowing the abil-

ity to apply weights before combination. This approach allowed for non-catastrophic

fusion at all noise levels provided that appropriate weights are applied, but no de-

cision could be made until both classifiers are complete, limiting the ability of late

integration to easily perform continuous speech recognition.

This chapter will introduce the concept of middle integration methods that combines

the close time coupling of the feature-fusion approach with the ability to model stream

reliability inherit in the late integration approach. Particular focus will be given to the

SHMM as it can be trained easily using existing techniques derived from the uni-

modal HMM training process outlined in Chapter 3, and is known to work well for

136 7.2 Multi-stream HMMs

speech and speaker recognition applications.

In order to improve understanding of the SHMM model, this chapter will look at

novel research into the effect stream weighting of the acoustic and visual modalities,

both in training and testing, has on the final speech recognition performance. While

researchers have studied stream weights of MSHMMs, no consideration in the liter-

ature has yet been made to the difference between the training and testing processes

for SHMMs under differing stream weights.

Additionally, this chapter will introduce the concept of normalisation, usually used in

a late integration design, within the SHMM model to normalise the differing acous-

tic and visual models within the SHMM states on a frame-by-frame basis. Both full

mean and variance normalisation, and variance-only normalisation will be investi-

gated with both showing similar performance in flattening the performance curve as

the stream weights are varied, allowing for more latitude in choosing appropriate

stream weights.

7.2 Multi-stream HMMs

Multi-stream HMMs are a group of temporally-coupled modelling techniques de-

signed to extend the effectiveness of the uni-modal HMM structure for speech pro-

cessing into the multi-modal domain. A number of variations exist within the broad

label of multi-stream HMMs, with the major difference between each model hinging

upon where the acoustic and visual information is tied, or coupled, together. All these

techniques fall under the even broader umbrella of dynamic Bayesian networks [124],

and examples of the most popular multi-stream HMMs in use for AVSP are shown in

Figure 7.1.

The simplest multi-stream HMM is the SHMM, shown in Figure 7.1(b), which cou-

ples the acoustic and visual observations at every frame. Such an approach results in

an almost identical HMM structure to a uni-modal HMM, but with two observation-

7.2 Multi-stream HMMs 137

(a) Unimodal (acoustic) HMM (b) Synchronous HMM

(c) State-asynchronous HMM (d) Coupled HMM

(e) Product/Factorial HMM

Figure 7.1: Various multi-stream HMM modelling techniques used for AVSP in com-parison to the uni-modal HMM (a). Acoustic emission densities are shown in blueand visual in red.

138 7.2 Multi-stream HMMs

emission GMMs in each state instead of one. Due to the simplicity of the SHMM and

existing implementations for separating differing features in acoustic speech recogni-

tion (such as static and dynamic features), the earliest attempts at middle integration

for AVSP were undertaken with this modelling technique by Potamianos et al. [141]

for speech recognition and Wark et al. [187] for speaker recognition.

While the SHMM did–and still continues to–work well for AVSP, researchers have

continued to research alternative multi-stream HMMs in which the acoustic and vi-

sual information was not coupled as tightly as in the SHMM. Such an approach is

motivated by the asynchronous nature of audio visual speech, as it has been known

for some time that the visual speech activity tends to precede the acoustic signal it

generates by up to 120 ms [16, 95].

To handle the asynchrony between the audio and visual speech informationwhilst still

maintaining alignments at the model boundaries a state-asynchronous HMM [12] can

be generalised as two uni-modal HMMs tied together at the boundaries of the speech

event being modelled. An example of such an approach is shown in Figure 7.1(c).

An alternative approach is taken in the coupled HMM [122] shown in Figure 7.1(d)

where the acoustic and visual states can transition within the asynchronous region,

but remain tied at the model boundaries.

While both the asynchronous and coupled HMMs can be implemented directly, and

have been done so for AVSP by multiple researchers [12, 122], a common simplifica-

tion is to implement a generalised form of such networks as a product or factorial

HMM [145]. While such an approach does require more states (S2 compare to 2S)

than the dynamic Bayesian network approach, it does allow for implementation using

a synchronous HMM with additional states linked as shown Figure 7.1(e) such that

multiple states share the same acoustic or visual state models.

The main difference between the asynchronous and coupled HMMs when imple-

mented as a product HMM arise from the method of calculating the additional state

transitions and probabilities. An additional simplification that can be applied is to

7.3 Synchronous HMMs 139

limit the permitted asynchrony in the product HMM, which in the extreme limit of

no asynchrony would result in only the diagonal of the product HMM remaining as a

SHMM.

Both the asynchronous and coupled HMMs, whether implemented directly or using

product HMMs, are much more complicated to train and test than the comparatively

simple SHMM, and thus have mostly been limited to small vocabulary recognition

tasks [49, 124, 145] with only limited large vocabulary implementations [126]. The

onlymiddle integrationmethod that has successfully been demonstrated in real world

conditions for large vocabulary speech recognition appears to be Potamianos et al.’s

implementation of the SHMM [145], although a large vocabulary product-HMM ap-

proach has been demonstrated through lattice re-scoring [126].

7.3 Synchronous HMMs

7.3.1 Introduction

A SHMM can be viewed as a regular single-stream continuous HMM, but with two

observation-emission Gaussian mixture models (GMMs) for each state–one for audio,

and one for video–as shown in Figure 7.1(b). SHMMs have previously been used

in audio-only speech recognition tasks to consider differing types of audio features

separately, such as static features from time-derivative-based features [194]. For AVSP,

audio-visual SHMMs use a different stream for each modality, and this approach has

been used extensively for both speech and speaker recognition research [83, 49, 126,

188, 145].

SHMMs are at an advantage over feature-fusion HMMs primarily because of their

ability to weight each modality on an individual basis. Feature-fusion HMMs are

trained on statemodels estimated over the entire concatenated or discriminative audio-

visual vector. Because both modalities’ features are combined in the one model, it

140 7.3 Synchronous HMMs

is not possible within the feature fusion design to consider a situation where one of

the modalities has more weight than the other. By allowing the two modalities to be

treated independently, the SHMM model is more flexible and can generally provide

greater AVSP performance [145].

Given the audio and visual observation vectors oa,t and ov,t, the observation-emission

score of SHMM state u is given as

P (oa,t;ov,t |u ) = P (oa,t |u )α P (ov,t |u )1−α (7.1)

where α is a single stream weighting parameter 0 ≤ α ≤ 1 defined identically to that

used in Chapter 6 for output score fusion.

The SHMM parameters can then be defined as λλλav = [λλλav,α] where λλλav = [Aav,Ba,Bv].

In the underlyingHMMparameters λλλav, the joint state-transition probabilities are con-

tained in Aav , and Ba and Bv represent that observation-emission probability param-

eters of the audio and video modalities respectively [145]. Training of the SHMM is

the process of estimating these parameters. The parameters in λλλav can be estimated in

an automatic manner using Baum-Welch re-estimation (see Chapter 3), and the stream

weight parameter α is typically estimated by maximising speech performance on an

evaluation session, although more flexible methods based on the concept of stream

reliability have been developed [145].

7.3.2 SHMM joint-training

In the existing literature [126], the estimation of the underlying HMM parameters λλλav

have been performed in one of two manners: either the single-stream parameters are

estimated independently and combined, or the entire set of parameters are jointly-

estimated using both modalities. Because the combination method makes the incor-

rect assumption that the two HMMs were state-synchronous before combination, bet-

ter performance has been shown to be obtained with the joint-training method [126],

which is used to train the SHMMmodels evaluated in this chapter.

7.3 Synchronous HMMs 141

The Baum-Welch re-estimate algorithm is the iterative process used to calculate the

HMM parameters from a training set of representative speech events. The algorithm

was covered in detail in Chapter 3, but can be briefly outlined as follows:

1. Use the HMM parameters (emission and state transition likelihoods) and the

training data to estimate the state-occupation probability Lj (t) for all states j

and times t.

2. For each stream, use the state-occupation probability and the training data to

re-estimate new HMM parameters.

3. Repeat at Step 1 if the HMM parameters have not converged.

As the Baum-Welch algorithm requires a initial set of HMM parameters to form the

first estimate of Lj (t), the parameters are generally initialised by segmenting the train-

ing observations equally amongst the state models. From these segmented training

observations the initial set of observation-emission parameters are determined for

each state. From this point, the Baum-Welch algorithm can take over to refine the

state-alignments and HMM parameters until they have converged upon a solution.

In an audio-visual SHMM, it can be seen that the choice of the stream weighting pa-

rameter α only has a direct effect on the state-occupation probabilities estimation in

Step 1, as this probability is directly based on the observation-emission likelihoods

calculated using (7.1). As the observation-emission likelihoods of each stream are cal-

culated independently in Step 2, they are not directly affected by the streamweighting

parameter, to the extent that they will still be calculated even if the pertinent stream is

weighted to nothing.

The jointly-trained SHMMs used for the experiments in this chapter were trained

according to the speech processing framework, but due to limitations of the HMM

Toolkit [194], only speaker-independent background word SHMMs could be trained.

For this reason, only speaker-independent speech recognition performance of SHMMs

will be evaluated in this chapter.

142 7.4 Weighting of synchronous HMMs†

The SHMM topology was chosen to match the uni-modal HMM topologies used for

Chapter 5, with the number of states taken from the acoustic HMM, and the number

of mixtures for each stream taken from both. The resulting SHMM topology had 11

states with a 8mixture acoustic GMM and a 16mixture visual GMM representing each

state of the speaker-independent models. Training was performed with the HMM

Toolkit [194] which already had built in support for joint-training of SHMMs.

7.4 Weighting of synchronous HMMs†

7.4.1 Introduction

The primary benefit in choosing a SHMM approach for AVSP over early integration is

the ability to weight the acoustic and visual streams based on the perceived reliability

of each individual modality. Accordingly, knowingwhat effect the streamweights has

on the final performance of the SHMM model is an important precursor to training

SHMMs to model audio-visual speech. While a number of researchers have studied

the effect of SHMM stream weights during the decoding of audio-visual speech [61,

69], there has been no research in the literature on what effect the streamweights have

during the training process.

In this section, a study will be performed to determine what effect, if any, that varying

stream weights have during the SHMM training process on the final speech recog-

nition performance. The outcome of these experiments will also be compared and

contrasted with varying the stream weights during speech decoding, where they are

already known to have a significant impact on performance [61].

7.4 Weighting of synchronous HMMs† 143

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e

0dB SNR6dB SNR12dB SNR18dB SNR

Figure 7.2: Speech recognition performance using SHMMs as αtest is varied. Eachpoint represents a different αtrain and the line is the average of all αtrains for each αtest.

7.4.2 Results

To investigate the effect of varying the training and testing stream weights indepen-

dently, the single stream weighting parameter α was sub-divided into two separate

parameters, αtrain and αtest, representing the stream weights used during training and

testing of the SHMM respectively. 11 different training alphas, αtrain = 0.0,0.1, ...,1.0,

and testing alphas, αtest = 0.0,0.1, ...1.0, were combined to arrive at 121 individual

speech experiments. These experiments were performed over all 4 testing noise levels,

resulting in a total of 484 tests. To limit the processing time, these weighting experi-

ments were only performed on the first configuration of the XM2VTS database under

the speech processing framework.

The resulting WER obtained for each of these experiments against αtest is shown in

Figure 7.3. Each of the points within each αtest is a differing αtrain and the line shows

144 7.4 Weighting of synchronous HMMs†

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtrain

Wor

d E

rror

Rat

e0dB SNR (α

test = 0.7)

6dB SNR (αtest

= 0.8)

12dB SNR (αtest

= 0.9)

18dB SNR (αtest

= 0.9)

Figure 7.3: Speech recognition performance using SHMMs as αtrain is varied. αtest ischosen based on the best average performance in Figure 7.2.

the average WER over all αtrain for a particular αtest. To illustrate the relatively static

performance as αtrain is varied, a similar plot of the WERs as αtrain is varied at the best

performing αtest for each noise level is shown in Figure 7.6.

7.4.3 Discussion

From examining both these figures, it can be seen that the variance in WER of the

entire range of αtrain is of little-to-no significance to the final speech recognition per-

formance. The choice of αtest is clearly the major factor in the speech performance,

with the minimum WER achieved with αtest around 0.8− 0.9, at least in the cleaner

conditions. However, there appears to be no significant trend visible in the WER as

αtrain varies from 0.0 to 1.0.

7.5 Normalisation of synchronous HMMs† 145

As discussed earlier, the training of a HMM is basically an iterative process of contin-

uously re-estimating state boundaries, and then re-estimating the HMM parameters

based on those boundaries. The value of αtrain has no direct effect on the re-estimation

of the HMM parameters, so the only effect of αtrain comes about when using the es-

timated HMM parameters to arrive at a new set of estimated state boundaries. For

example, if αtrain = 0.0, then only the video parameters will determine the state bound-

aries during training. Similarly αtrain = 1.0 will only use the audio, and values between

those two extremes will use a combination of both modalities for the task.

As the speech transcription is known, training of a HMM is a much more constrained

task than the decoding of unknown speech. The 18dB SNR results presented in the

previous section shows the decoding WER varies from below 5% for audio-only to

a much above 35% for video only, when the testing weight parameter, αtest, is set at

the extremes of 1.0 and 0.0 respectively. That changing the training weight parameter,

αtrain, has no similar effect on the final speech recognition performance suggests that,

at least in this case, the video or audio models perform equally well in estimating the

state boundaries during training, and there appears to be no real benefit to any fusion

of the two.

7.5 Normalisation of synchronous HMMs†

7.5.1 Introduction

Score normalisation is a technique used in multimodal biometric systems to com-

bine scores from multiple different classifiers [78] that may have very different score

distributions. By transforming the output of the classifiers into a common domain,

the scores can be fused through a simple weighted combination of scores, where the

weights can more accurately represent the true dependence of the final score on the

individual classifiers. Normalisation was used previously in the output score fusion

experiments in Chapter 6.

146 7.5 Normalisation of synchronous HMMs†

Two approaches were chosen to normalise the acoustic and visual streams within the

SHMM structure: full and variance-only normalisation. The full normalisation ap-

proach allows both the mean and variances of the two modalities to be matched, but

requires access to the internals of the Viterbi decoder. The variance-only normalisation

technique was developed to allow for a similar effect through a simple modification

of the stream weights, allowing implementation in a wider number of circumstances.

Full mean and variance normalisation

For the SHMM normalisation technique, it was chosen to adapt the video-score dis-

tribution to that of the audio-score, rather than perform zero normalisation on both

distributions. This configuration was chosen because zero-normalisation would cause

the state-emission log-likelihood-scores to be much smaller than the state-transition

log-likelihoods, causing the final speech recognition to be overwhelmed by the lat-

ter. By using the audio log-likelihood-score distribution as a template, the final state-

emission scores should be in a similar range to that of the un-normalised SHMM.

To perform the video normalisation the output of the video-state models were first

transformed to the standard normal distribution, then to the audio distribution. The

final log-likelihood score s f from the combined SHMM state-model is therefore given

as

s f = αsa + (1− α)sv − µv

σv︸︷︷︸

→N(0,1)

× σa + µa︸︷︷︸

→N(µa,σ2a )

(7.2)

One of the problems with this form of normalisation, however, is that implementing

the full mean and variance normalisation of (7.2) requires access to the interior of the

Viterbi decoder algorithm, which can be difficult with publicly available HMM tools,

such as the HMM Toolkit [194].


Variance-only normalisation

To overcome this difficulty, another option is to only normalise the variances of the two

score log-likelihood distributions, as this can be implemented solely through adjust-

ment of the stream weighting parameter. Provided that the fused score log-likelihood

distribution is not too dissimilar after normalisation, to prevent problems with over or

under-whelming the state-transition scores, the means of the two score distributions

do not necessarily need to be equal. This is because speech recognition is a compar-

ative task, and a change in mean on a whole-stream basis will have no effect on the

paths through the speech recognition lattice, as each path will be affected similarly.

Therefore, if mean normalisation is not required, normalisation can bemore easily per-

formed by considering the final modality weights (γa, f inal,γv, f inal) to be a combination

of the intended test weights and calculated normalisation weights:

γa, f inal = γa,test × γa,norm (7.3)

γv, f inal = γv,test × γv,norm (7.4)

Where the testing and normalisation weights can further be expressed in terms of the

single weighting parameters, αtest and αnorm respectively:

γa, f inal = αtest × αnorm (7.5)

γv, f inal = (1− αtest) (1− αnorm) (7.6)

However, to ensure the state-emission scores remain in the same general relationship

with the state-transition scores, the stream weights should add to 1. Using (7.5) and

(7.6) to arrive at the final weight parameter:


−50 0 500

0.05

0.1

0.15

0.2

0.25

Score

Fre

quen

cy


(a) No normalisation

−50 0 500

0.05

0.1

0.15

0.2

0.25

Score

Fre

quen

cy


(b) Full normalisation

−50 0 500

0.05

0.1

0.15

0.2

0.25

Score

Fre

quen

cy


(c) Variance normalisation

Figure 7.4: Distribution of per-frame scores for individual A-PLP audio and videostate-models within the SHMM under different types of normalisation.

α f inal =γa, f inal

γa, f inal + γv, f inal(7.7)

α f inal =αtestαnorm

αtestαnorm + (1− αtest) (1− αnorm)(7.8)

To calculate the normalisation weighting parameter αnorm, use the following property

of normal distributions,

kN ∼(

kµ, (kσ)2)

(7.9)

and attempt to equalise the standard deviations of the two weighted score distribu-

tions:

αnormσa = (1− αnorm) σv (7.10)

αnorm =σv

σa + σv(7.11)

7.5.2 Determining normalisation parameters

Before the score-normalisation could occur, the normalisation parameters of both dis-

tributions were determined by scoring the known transcriptions on the evaluation


Datatype µi σiA-PLP -59.81 9.52

V-LDA-MRDCT -6.23 28.71

Table 7.1: Normalisation parameters determined from the per-frame evaluation scoredistributions

session with stream weight parameter, α, set such that only the modality of interest

was being tested (i.e. α = 0 for the video frame-scores and α = 1 for the audio).

A full speech recognition task was also attempted (rather than force alignment with

a transcription) to calculate the distribution parameters, but no major difference was

noted in the final parameters. This was most likely because the difference between the

two modalities’ score distributions was much larger than any difference between the

score distributions of different state models within a particular modality.

The scores of the best path were then recorded on a frame-by-frame basis to determine

the score-distribution of each modality, shown in Figure 7.4(a). The normalisation

parameters, shown in Table 7.1 , were then estimated from the score distributions for

each modality.

The effect of full mean-and-variance normalisation using these parameters on the

score-distribution is shown in Figure 7.4(b). It can be seen that the audio score dis-

tributions remain untouched, and the video scores have been transformed into the

same domain as the audio. Because this normalisation occurs within the Viterbi de-

coding process, an in-house HMM decoder was used to implement this functionality,

as it was not possible within the HMM Toolkit [194].

To perform variance-only normalisation, the normalisation parameters shown in Ta-

ble 7.1, and (7.11) were used to arrive at a normalisation weighting parameter of

αnorm = 0.751 .

Using these normalisation weighting parameters and the relationship shown in (7.8),

any intended αtest can be mapped to the equivalent α f inal which includes the effects of

variance-normalisation, as shown in Table 7.2. The outcome of applying these vari-


Parameter Valueαtest 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α f inal 0.00 0.25 0.43 0.56 0.67 0.75 0.82 0.88 0.92 0.96 1.0

Table 7.2: Final weighting parameter α f inal calculated from the intended weightingparameter αtest using the normalisation parameter αnorm = 0.751.

ance normalisation parameters on the unweighted score distributions is shown in Fig-

ure 7.4(c). It can be seen that the variance of the two score distributions has been

equalised, while the means are still very separate, although both have changed from

the non-normalised score distributions.

7.5.3 Results

To investigate the effect of these two normalisation techniques on speech recognition

performance, a series of tests were conducted at varying levels of αtest for both meth-

ods of score normalisation. These scores were conducted using the models trained in

the previous section with the training weight parameter of αtrain = 0.8. As discussed

previously, the choice of the training weight parameter was fairly arbitrary, but these

values were chosen as they had the lowest average WER over all values of αtest and

all noise levels by a minor margin. The results of the two normalisation methods are

shown in comparison to the un-normalised speech recognition performance in Fig-

ure 7.5.

7.5.4 Discussion

From examining Figure 7.5, it can be seen that both normalisation methods are very

similar about the best-performing section of the curve in cleaner conditions. Perform-

ing full mean and variance normalisation of the video state models into the same

domain as the audio state models does appear to give a noticeable improvement in

the video-only (αtest = 0.0) SHMM performance. However, the video-only perfor-

mance of the normalised SHMMs still does not match the uni-modal video HMMs’


0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e

NoneVariance OnlyMean and Variance

(a) 0 dB SNR

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(b) 6 dB SNR

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(c) 12 dB SNR

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(d) 18 dB SNR

Figure 7.5: Speech recognition performance under normalisation

152 7.6 Speech recognition experiments†

speaker independent speech recognition WER of 27.9. Accordingly, it is unlikely that

the SHMMs would be used in this configuration, so the effect of either normalisation

method at αtest = 0.0 is of only minor importance.

From the stream-weightingWER curves in Figure 7.5, and the variance-normalisation

mappings between the intended and final weighting parameters shown in Table 7.2, it

can be seen that normalisation is essentially moving the center of the αtest range closer

to the best-performing non-normalised αtest, and also producing a flatter WER curve

around this point. These results show that the main effect of the best-performing

weighting parameter of αtest ≈ 0.8− 0.9 from the earlier weighting experiments pri-

marily serves to normalise the two modalities rather than indicate their impact on

the final SHMM performance. The best-performing αtest in the normalised system is

much closer to 0.5, indicating that both modalities are contributing almost equally to

the final performance.

Each of the 12 configurations of the XM2VTS database established in the speech pro-

cessing framework has a different combination of training, testing and evaluation ses-

sions. For this reason each configuration testing in these experiments calculates and

uses different normalisation parameters to perform the full normalisation based upon

the evaluation session of that particular XM2VTS configuration.

7.6 Speech recognition experiments†

7.6.1 Choosing the stream weight parameters

Before the SHMM design can be used to recognise speech over all configurations of

the XM2VTS data defined in the speech processing framework, the training and test-

ing stream weights must be chosen such that the SHMM works best over all acoustic

noise conditions. The stream weighting experiments in Section 7.4 showed that the

training stream weight parameter αtrain has little impact on the final speech recogni-

7.6 Speech recognition experiments† 153

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


Figure 7.6: Speaker independent speech recognition using full-normalised word-model SHMMs as αtest is varied on the first configuration of the XM2VTS database.

tion performance, with the major impact on final performance arising from the testing

stream weighting parameter αtest.

Based on the earlier SHMM normalisation experiments, the final SHMM used for the

speech recognition experiments in this chapter included full normalisation of the vi-

sual mean and variances tomatch the acoustic. The choice of a single, post-normalisation

αtest to perform speech recognition over the full range of acoustic noise levels was

made by looking at the speech recognition performance of the normalised SHMMover

the first configuration of the XM2VTS database under the speech processing frame-

work, shown in Figure 7.6.

Of course, while the choice of αtrain has been demonstrated earlier to be of little import,

some choice did have to be made, and in these cases, αtrain = 0.8 was chosen as it had

the lowest average WER over all values of αtest and all noise levels by a minor margin.


0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80


Wor

d E

rror

Rat

eA−PLPV−LDA−MRDCTDiscriminative FFJointly−trained SHMM

Figure 7.7: Speaker-independent speech recognition performance using SHMMs overall 12 configurations of the XM2VTS database.

From the weighting curves shown in Figure 7.6, it can be seen that a testing stream

weighting parameter of αtest = 0.5 worked best when averaged over all noise levels.

Accordingly, this value was chosen for the testing stream weighting parameter of the

final SHMM speech processing experiments conducted in this chapter.

7.6.2 Results

The SHMMspeech recognition experimentswere conducted using the speech process-

ing framework developed in Chapter 4. However, due to the inability of the HMM

Toolkit [194] to perform adaptation on multi-stream HMMs, only the background

speech models could be trained, and therefore only the speaker independent speech

recognition experiments could be performed with the jointly-trained SHMMs. An al-

ternate method of training speaker-dependent SHMMs will be detailed with FHMM-

adaptation in the Chapter 8.

7.7 Discussion 155

The results of the speaker independent speech recognition experiments, over all con-

figurations of the XM2VTS database, are shown in Figure 7.7. The SHMM speech

recognition WERs are shown against the equivalent discriminative feature-fusion sys-

tem and uni-modal acoustic HMMs. The uni-modal visual HMM results are also

shown as an additional baseline.

7.7 Discussion

In looking at the speech recognition results in Figure 7.7, it can be seen that the ability

to normalise and weight individual modalities provided by the SHMM design gives

a significant benefit over feature-fusion-based designs. While in this case a single

set of stream weights was found to work well over the entire range of acoustic noise

conditions for speaker independent speech recognition, the SHMM design can also

allow for the weights to be changed based on the prevailing recognition conditions.

This would provide for even better speech recognition performance than has been

presented here, but a method of estimating the environmental conditions and deriving

an appropriate weighting parameter would first have to be developed.

In comparison to the uni-modal acoustic HMMs, the SHMM design provides better

or similar performance over the entire range of acoustic noise conditions when nor-

malised and weighted at 50% audio and 50% video (αtest = 0.5). Even in the cleaner

acoustic conditions, the introduction of visual speech information did not noticeably

decrease the SHMM performance relative to the uni-modal acoustic HMM.

One of the more important considerations of an audio-visual speech processing sys-

tem is that such a system remain non-catastrophic in poor acoustic conditions, as poor

acoustic conditions are one of the main selling points of using audio-visual speech

information over audio alone. Thankfully, the SHMMdesign does meet these require-

ments at least down to 0 dB SNR, allowing such an approach to be used confidently in

quite degraded conditions without causing catastrophic fusion. Of course, the ability


to dynamically change SHMM stream weights at run-time would allow the SHMM to

be run in even worse acoustic conditions through adjusting the stream weights closer

to αtest = 0.0, or video-only performance as the audio became unusable.

7.8 Chapter summary

In this chapter the middle integration approach was introduced combining the below-

utterance-level fusion of the early integration approach with the ability to weight the

acoustic and video modalities separately inherit in the late integration approach. A

number of middle-integration multi-stream HMMs have been used in the AVSP liter-

ature, of which the most popular choices were reviewed early in this chapter, compar-

ing and contrasting the strictness and placing of the audio-visual couplings within the

various multi-stream modelling techniques.

The simplestmulti-streamHMM, and the subject of this chapterwas the SHMMwhich

coupled the acoustic and visual observations together within every state of the HMM.

By keeping the design simple, the SHMM design is much easier to train and test with

limited data than other more complicated multi-stream HMMs in use in the literature.

Because the main benefit of the SHMM design over a simple feature-fusion HMM

is the ability to treat each modality separately, this chapter presented a number of

novel experiments in the weighting and normalisation of the audio and visual streams

within the SHMM design.

By examining the effect of varying the stream weights during training and testing of

the SHMM on the final speech recognition performance, it was determined that the

choice of stream weights used during the training of the SHMM had no real effect

on the final performance of the SHMM, with the main factor in the final performance

being the choice of stream weights during testing.

In order to improve the ability to weight the acoustic and visual modalities accurately


in testing conditions, the concept of classifier normalisation used for output score fu-

sion in the previous chapter was introduced within the HMM decoder to allow nor-

malisation with the SHMM structure. As getting access within the HMM decoder

can be difficult with some HMM decoders, an alternative variance-only form of nor-

malisation was designed that could be implemented solely through the adjustment of

the stream weighting parameters, with similar performance as the full normalisation

method. SHMM normalisation was found to flatten the speech recognition perfor-

mance of the SHMMas the streamweights are varied, andmoved the best-performing

stream weighting parameters close to equal audio and video weights in comparison

to the 80-90% audio weighting that performed best for un-normalised SHMMs.


Chapter 8

Fused HMM-Adaptation of

Synchronous HMMs

8.1 Introduction

This chapter will introduce a novel method of adapting a SHMM from already trained

unimodal acoustic HMMextended from Pan et al.’s [130] proposedFusedHMM (FHMM)

classifier structure. By adapting the visual state classifiers directly from training seg-

mentations performed by a well-performing acoustic HMM, the FHMM-adaptation

process can produce a SHMM that outperforms the jointly-trained SHMM at all levels

of acoustic noise, with no increase in model complexity. This method can also be used

to create speaker-adapted visual states for use alongside the speaker adapted acoustic

HMM, allowing for speaker dependent speech models for the use in speaker depen-

dent speech recognition and speaker verification. Such speaker dependent SHMMs

were not possible within the HMM Toolkit [194] for jointly-trained SHMMs, but the

FHMM-adaptation method will allow for these speaker dependent speech processing

tasks to be demonstrated using FHMM-adapted SHMMs in this chapter.

This chapter will begin with an introduction to Pan et al.’s [130] original theory and

160 8.2 Discrete fused HMMs

implementation of the FHMM structure, which consisted of a continuous classifier

for the acoustic modality combined with a static vector-quantisation classifier for the

visual modality with each state of a SHMM-like structure. By extending Pan et al.’s

speech processingmodel by replacing the discrete vector-quantisation classifier with a

continuous GMM classifier, it will be demonstrated that this continuous FHMM struc-

ture can be considered identical to a SHMM, and therefore the FHMM trainingmethod

can be considered a novel approach to training a SHMM through adaptation based on

the state alignment of a uni-modal acoustic HMM.

Finally, the FHMM-adapted SHMM will be demonstrated for the applicable speech

processing tasks under the speech processing framework developed in Chapter 4, to

demonstrate the improved speechmodelling ability of the FHMM-adaptation method

over joint training of SHMMs.

8.2 Discrete fused HMMs

8.2.1 Introduction

The original design of Pan et al.’s FHMM [130] was motivated by an attempt to max-

imise the mutual information between the two tightly coupled acoustic and visual

streams for audio-visual speech processing tasks while keeping the design of the re-

sulting multi-stream relatively simple when compared to some of the more compli-

cated multi-stream HMM techniques such as coupled or asynchronous HMMs.

In their work on the FHMM structure which is outlined below, Pan et al. showed

that the maximum mutual information was obtained when the observations of one

modality are combined with the states of the other, rather than in a design that links

the hidden states of separate HMMs [130].

This resulting FHMM structure can either be acoustically or visually biased based

upon which modality controls the state transitions during training, and in the Pan et

8.2 Discrete fused HMMs 161

al.’s implementation both biased versions were considered in output decision fusion

for speaker verification. As this implementation implemented the subordinate modal-

ity using discrete vector-quantisation of the subordinate observations, this FHMM

structure will be referred to as discrete FHMMs for the remainder of this thesis.

This section will outline Pan et al.’s theory behind calculating the optimal multi-

streamHMM structure based onmaximising the mutual information between the two

modalities, and will finish by outlining briefly the discrete FHMM implementation

implemented by Pan et al. for audio-visual speaker verification.

8.2.2 Maximising mutual information for audio-visual speech

In their original work on calculating the joint probability of audio-visual speech, Pan

et al. [131] showed that the optimal solution for the joint probability of a particular

sequence of coupled acoustic and visual observations, Oa and Ov , can be calculated

according to the maximum entropy principle [80] as

p (Oa,Ov) = p (Oa) p (Ov)p (w,v)

p (w) p (v)(8.1)

wherew = ga (Oa), and v = gv (Ov) are transformations designed such that p (w,v) is

easier to calculate than p (Oa,Ov), but still reflects the statistical dependence between

the two streams. The final term in (8.1) can therefore be viewed as a correlationweight-

ing, which will be high if w and v (and therefore p (Oa) and p (Ov)) are related, and

low if they are mostly independent. In their work, Pan et al. [131] also showed that the

minimum distance between p (Oa,Ov) and the ground truth p (Oa,Ov) is established

when the mutual information betweenw and v is maximised:

(w, v) = arg max(w,v)∈θ

I (w,v) (8.2)

In their audio-visual FHMM paper [130], Pan et al. chose w and v empirically from

162 8.2 Discrete fused HMMs

the following set (Θ):

w = Ua, v = Ov (8.3)

w = Ua, v = Uv (8.4)

w = Oa, v = Uv (8.5)

where Uxis an estimate of the optimal state sequence of HMM x for output Ox. By

invoking (8.2) over the set Θ and invoking the following inequality in information

theory

I (x, f (y)) ≤ I (x,y) (8.6)

And considering that estimated hidden state sequences can be viewed as a function of

the output (Ux = fx (Ox)), Pan et al. [130] concluded that

I (Ua, Uv

)= I (

Ua, fv (Ov)) ≤ I (

Ua,Ov)

(8.7)

I (Ua, Uv

)= I (

fa (Oa) , Uv

) ≤ I (Oa, Uv

)(8.8)

Therefore the transforms (8.3) and (8.5) can produce better estimates of p (Oa,Ov)than

(8.4). More generally, this indicates that it is better to fuse twoHMMs together through

a combination of the states of one with the observations of the other, rather than link-

ing the hidden states of the two HMMs.

By invoking (8.3) in (8.1):

pa (Oa,Ov) = p (Oa) p (Ov)p(Ua,Ov

)

p(Ua

)p (Ov)

= p (Oa) p(Ov| Ua

)(8.9)

8.2 Discrete fused HMMs 163

where p (Oa) can be estimated from a regular audio HMM and p(Ov| Ua

)is the like-

lihood of getting the video output sequence given the estimated audio HMM state

sequence which producedOa. This equation represents the audio-biased FHMM as the

main decoding process comes from the audio HMM.

Similarly, invoking (8.5) to arrive at the video-biased FHMM gives:

pv (Oa,Ov) = p (Ov) p(Oa| Uv

)(8.10)

The choice of the audio- or video-biased FHMM should be chosen upon which in-

dividual HMM can more reliably estimate the hidden state sequence for a particular

application. Alternatively, both versions can be use concurrently and combined us-

ing output fusion, as in the speaker verification experiments of Pan et al.’s original

implementation[130].

8.2.3 Discrete implementation

The process of training the acoustically and visually biased FHMM structures are

given by Pan et al. as a three step process [130]:

1. Two individual HMMs are trained independently by the EM algorithm.

2. The best hidden state sequence of the HMMs are found using the Viterbi algo-

rithm.

3. The coupling parameters are determined.

Where the coupling parameters are represented by the final conditional probability

terms in (8.9) and (8.10), or put simply: the likelihood distribution of getting a par-

ticular subordinate observation for a particular dominate HMM state. These distribu-

tions can be estimated simply by aligning the hidden state sequences of the dominant

164 8.3 Fused HMM adaptation of synchronous HMMs†

HMM with the subordinate observations and forming some model of which subordi-

nate observations are likely to coincide with each of the dominant HMM states.

In Pan et al.’s implementation of their FHMM structure a vector-quantisation code-

book was used to represent the subordinate modality alongside the continuous HMM

trained on the dominant modality. Pan et al.’s acoustically biased FHMM combined

linear prediction coding (LPC) cepstral coefficient-based HMM with a 16-word code-

book based on the raw gray-level pixel values of the ROI, and the visually biased

FHMM combined a raw gray-level pixel visual HMMwith a 64-word codebook based

on the acoustic LPC features.

In their speaker verification experiments on a small in-house database, Pan et al.

found that output score fusion of both the acoustically and visually biased FHMM

structures outperformed coupled HMMs, but was still improved by a product HMM.

Pan et al. concluded that although the product HMM did perform better than their

FHMM structure, the simplicity of their FHMM structure gave it a real advantage in

real world situations [130].

8.3 Fused HMM adaptation of synchronous HMMs†

8.3.1 Continuous FHMMs

Pan et al’s implementation of their FHMMstructure, using a discrete vector-quantisation

codebook for the subordinate modalities, worked well for speaker verification in their

original paper [130]. Similar results were also replicated in a discrete implementation

of the FHMM structure by Dean et al. [44] on the single session CUAVE [133] database.

However, the use of a vector quantisation codebook for representing the widely vary-

ing natures of the acoustic and visual speech information available does not generalise

well over multiple recording sessions as the large variability in the acoustic and visual

information between different session cannot easily be represented in the limited con-

8.3 Fused HMM adaptation of synchronous HMMs† 165

(a) Discrete (acoustic-biased) FHMM (b) Synchronous HMM

Figure 8.1: By replacing the discrete secondary representations with continuous rep-resentations in Pan et al.’s [130] original FHMM, it can be seen that a SHMM will becreated.

fines of discrete vector quantisation codebook tables.

By extending the original FHMM design by representing the subordinate modality

using continuous GMMs, a novel extension of Pan et al’s original FHMM structure

was developed that is more robust to inter-session variability than the original discrete

implementation. However, rather than this continuous FHMM structure being a new

multi-stream HMM design, it can be seen that it is in fact equivalent to the SHMM

model introduced in the previous chapter, as shown in Figure 8.1.

Therefore rather than being seen as a novelmodel for representing audio-visual speech,

the continuous FHMM training method can be seen as a novel method of training a

SHMM based on the state sequences a single uni-modal acoustic or visual HMM.

8.3.2 Fused-HMM adaptation

The FHMM adaptation process can be considered identical to the original training

process, but the estimation of the subordinate coupling parameters is performed using

EM training of GMMs rather than through the training of a discrete codebook as in the

original implementation. Therefore the FHMM-adaptation of an acoustically-biased

SHMM from an already trained uni-modal acoustic HMM can be simply defined as a

two step process:

166 8.3 Fused HMM adaptation of synchronous HMMs†

1. Determine the best hidden state sequence of the acoustic HMMover the training

data.

2. For each state of the acoustic HMM:

(a) Train a visual GMM based upon the visual observations that coincide with

the acoustic state in the training data

(b) Append the visual GMM to the already existing acoustic HMM to produce

a SHMM

For the sake of simplicity of explanation, throughout the remainder of this section, the

FHMM-adaptation process will be outlined using the example of FHMM-adaptation

of an acoustically biased SHMM from an already existing uni-modal acoustic SHMM.

The equivalent process for FHMM-adaptation of a visually-biased SHMM, if needed,

can be easily derived by swapping the roles of the two modalities within the FHMM-

adaptation process.

The FHMM-adaptation process allows the state representation of the twomodalities to

be estimated separately, although the visual modality does of course depend upon the

acoustic. In this way, it can be seen to be similar to the separate estimation method of

producing a SHMM by combining two uni-modal HMM that was briefly touched on

in comparison to the joint-training method of SHMMparameter estimation. However,

in the case of FHMM-adaptation there is no concern of the states not being aligned,

because the state alignment of the video models is dictated by the acoustic HMM.

Additionally training of a SHMM using FHMM-adaptation is a quicker process than

jointly-training a SHMM, as the Baum-Welch re-estimation process only has to occur

for the acoustic observations as the state sequences are set in stone by the time the

visual state models are estimated.

Of course, the FHMM-adaptation method does require that a sufficiently good esti-

mate of the state sequences can be obtained from the acoustic data alone in training,

but as the SHMM-training stream weighting experiments showed in the Chapter 7,

8.3 Fused HMM adaptation of synchronous HMMs† 167

the choice of stream weights during training was of little importance on the final

speech performance, and therefore the choice of audio-only (αtrain = 1.0) dictated by

the FHMM-adaptation process should have no detrimental impact on the final perfor-

mance.

Background SHMMmodels

The FHMM-adaptation of the background speech SHMMs was performed over the

same training partitions of the XM2VTS database as the jointly-trained SHMMs in

accordance to the speech processing framework developed in Chapter 4.

However, instead of taking the training observations and transcriptions to jointly-train

a background SHMM for each word in the XM2VTS database, the FHMM-adaptation

process starts with the uni-modal acoustic background HMMs and generates a time-

aligned transcription of the words and states of the acoustic HMMs over the entire

training sequence. This time-aligned transcription can then be used to segment all the

video observations that coincide with each state of the original uni-modal HMM and

train a state-model video GMM for each of the original acoustic states. By appending

the resulting video GMMs to the acoustic GMMs already existing with the states of

the uni-modal HMM, a new audio-visual SHMM is generated.

Speaker-dependent SHMMmodels

Adapting the speaker-dependent FHMM-adapted models to specific speakers for the

purposes of speaker dependent speech recognition and speaker verification is a simple

process. Basically, the already speaker-dependent acoustic HMMs trained in Chap-

ter 5 are used as the time-alignment basis to MAP-adapt a speaker dependent video

GMM for each word and state from the existing FHMM-adapted background video

GMMs trained previously. By appending these speaker-dependent video GMMs to

the speaker-dependent acoustic HMMs a final set of speaker-dependent SHMMs can

168 8.4 Biasing of FHMM-adapted SHMMs†

easily be trained for each client speaker in the speech processing framework.

Decoding

As the SHMM trained using the FHMM-adaptation method is an normal SHMM, the

decoding of the HMM for speech or speaker recognition is conducted identically to

that of a jointly trained SHMM, and the resulting speech models can be normalised

and weighted in the same manner as their jointly-trained cousins.

8.4 Biasing of FHMM-adapted SHMMs†

8.4.1 Introduction

Because both background and speaker-dependent SHMM models can be trained us-

ing FHMM-adaptation, unlike joint-training of SHMMs using the HMM Toolkit [194],

speaker independent and speaker dependent speech recognition, as well as speaker

verification experiments can be performed using the FHMM-adapted SHMMs.

Both the background speech models and the speaker-adaptation of those models is

done according to the speech processing framework developed in Chapter 4, with the

underlying acoustic HMMs being identical to those used for the speech processing

experiments in Chapter 5. These acoustic HMMs are FHMM-adapted through the

addition of a 16 mixture GMM for each acoustic state to result in an 11 state SHMM

with 8 mixtures for the acoustic and 16 for the visual features, which is identical to the

jointly-trained SHMMs demonstrated in the previous chapter.

As well as the acoustically biased FHMM-adapted SHMM which has been the focus

of the FHMM-adaptation process detailed so far, visually-biased SHMM background

and speaker-adaptedmodels were also created by appending 8mixture acoustic GMM

to the original visual HMMs. Both the acoustic and visually biased SHMMs were

8.4 Biasing of FHMM-adapted SHMMs† 169

identical in topology, with the only difference being the order of the two modalities in

the concatenative fusion input vector. However, because the visually-biased SHMM

was based on the underlying visual HMM, the acoustic features are down-sampled to

the video frame rate before concatenation of the two feature vectors.

8.4.2 Acoustic or visual biased

In their original paper on their discrete FHMM structure, Pan et al. suggested that

the best approach is to choose the dominant modality as the one with the best ability

to discriminate speech, or through a combination of both in an output score fusion of

both FHMM designs [130]. Because continuous speech recognition cannot easily be

modelled through output score fusion of separate classifiers, a choice has to be made

for these experiments between the acoustic and visually biased FHMM-adaptation

process.

Taking Pan et al.’s suggestion of choosing the bestmodality based on the ability of said

modality to discriminate speechwould generally lead to the choice of the acoustically-

biased choice as the speech recognition ability of audio is much higher than video, as

was clearly demonstrated in the uni-modal HMM speech recognition experiments in

Chapter 5. However, as the FHMM-adaptation is a training procedure, the ability

of each modality to discriminate the boundaries of speech events in training should

be more important than the ability to decode unknown speech. And, as the train-

ing stream weight experiments in Chapter 7 have shown, there is little difference in

the ability of the acoustic or visual features to discriminate state boundaries during

training, making the choice of acoustically or visually biased FHMM-adaptation less

simple.

To come to a decision on the dominant modality for the FHMM-adaptation experi-

ments, both the acoustically and visually biased SHMMs trained using the FHMM

adaptation technique were compared for speaker independent speech recognition as

the testing stream weighting parameter αtest is varied on the first configuration of the

170 8.4 Biasing of FHMM-adapted SHMMs†

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(a) Acoustically biased

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(b) Visually biased

Figure 8.2: Performance of acoustic and visual biased FHMM-adapted SHMMs astesting stream weights are varied.

XM2VTS database. The results of these experiments are shown in Figure 8.2.

8.4.3 Discussion

Looking at the stream-weighting curves in Figure 8.2, the visually biased FHMMs

appear to provide better performance at the extreme points (αtest = 0 or αtest = 1).

However, the acoustically biased versions generally provide similar or better perfor-

mance at the best performing point of the curves. Additionally the best performing

streamweights for each noise condition appear to have less of a spread around the av-

erage best performing stream weight of αtest = 0.5, allowing for better unsupervised

speech decoding using the audio-biased SHMMs trained using the FHMM-adaptation

method.

While the performance difference between the acoustic and visually biased FHMM-

adaptation method is not large, the acoustically biased version was chosen for the

remainder of this thesis primarily because of the improved speech-recognition ability

in noisy acoustic conditions.


0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80


Wor

d E

rror

Rat

e

A−PLPV−LDA−MRDCTDiscriminative FFJointly−trained SHMMFHMM−adapted SHMM

Figure 8.3: Speaker independent speech recognition performance using FHMM-adapted HMMs over all 12 configurations of the XM2VTS database.

8.5 Speech recognition experiments†

8.5.1 Results

The unsupervised speech recognition experiments using the acoustically-biased FHMM-

adaptation process were conducted over all 12 configurations of the XM2VTS database

according to the speech processing framework developed in Chapter 4. Both speaker

independent experimentswere performed using the background SHMMs and speaker

dependent experiments using the SHMMs adapted to each client speaker under test.

Based on the stream weighting tuning experiments performed in Section 8.4, the un-

supervised stream weighting of αtest = 0.5 was chosen as it provided the best average

performance over all noise levels amongst all the possible stream weights.

The results of the speaker independent speech recognition experiments using the FHMM-


0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80


Wor

d E

rror

Rat

eA−PLPV−LDA−MRDCTDiscriminative FFFHMM−adapted SHMM

Figure 8.4: Speaker dependent speech recognition performance using FHMM-adaptedHMMs over all 12 configurations of the XM2VTS database.

adaptation process is shown in Figure 8.3. Both the jointly-trained SHMM and dis-

criminative feature-fusion experiments are also included for comparison, in addition

to the uni-modal acoustic and visual HMM performances.

The results of the speaker dependent speech recognition experiments using the FHMM-

adaptation process is also shown in Figure 8.4. While jointly-trained SHMMs couldn’t

be included, both the discriminative feature-fusion and uni-modal audio and video

HMMs are shown for comparison.

8.5.2 Discussion

From an examination of the FHMM-adapted versus jointly-trained SHMMs for speaker

independent speech recognition in Figure 8.3, it can clearly be seen that the FHMM-

adaptation method provides a clear improvement over the jointly-trained SHMMs at


0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(a) FHMM-adapted (audio-biased)

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

45

50

αtest

Wor

d E

rror

Rat

e


(b) Jointly trained

Figure 8.5: Comparing the A-PLP biased FHMM-adapted SHMM with a equivalentjointly-trained SHMM on the first configuration of the XM2VTS database.

all levels of acoustic noise tested in these experiments. This improvement begins at

around 1 point of WER in clean conditions, but rises to around 7 points at 0 dB SNR.

To further examine the reasons for the improvement of the FHMM-adaptationmethod

over joint-training, the effect of varying the testing stream weights for each of these

SHMM training methods can be examined to determine if any conclusions can be

drawn as to the effectiveness of the two methods.

Such a comparison between acoustically-biased FHMM-adaptation and joint-training

is shown in Figure 8.5. These test-weighting results were taken from the tuning experi-

ments for determining the best performing streamweights in Chapter 7 and Section8.4,

and therefore were only conducted on the first configuration of the XM2VTS database

under the speech processing framework.

The main difference between the two SHMM training methods evident in Figure 8.5

is the clearly improved performance of the FHMM-adapted SHMM in the 0 and 6 dB,

while the performance increase of around 0.7 points in the cleaner conditions is barely

perceptible at the scale of the graph. It appears that even though the jointly-trained


SHMM produces a better video speech recognition performance when αtest = 0, the

FHMM-adaptation method of training the video state models produces video models

that interact better with the pre-existing acoustic models than are produced through

joint-training method of SHMM parameter estimation.

However, the improvement in the video performance of the FHMM-adapted SHMM

is not due to the audio-only state boundaries estimated during training being better

than video-estimated, or any combination of the two. In fact, as the experiments with

jointly trained SHMMs in Chapter 7 have shown, the choice of αtrain, in this particular

case, has no real effect on the final speech recognition performance. As the FHMM-

adapted SHMM is derived from an audio-only HMM, the FHMM-adapted SHMMcan

be considered to be training the state models on the same estimated state alignments

as a jointly trained SHMM with αtrain = 1.0. However, if our FHMM-adapted system

is compared with such a jointly-trained SHMM, the FHMM-adapted performance is

still much improved.

The main reason for the improvement of the FHMM-adapted SHMM video models

appears to be related to poor initialisation of the Baum-Welch training algorithm for

video HMMs. Before the training algorithm can begin, a bootstrapped estimate of the

HMM parameters must first be provided. Typically, and in our particular case, these

estimates are based on a uniform segmentation of the training data [194], basically

dividing each group of training observations for a particular word-model into equal

segments for each underlying state. While this segmentation is obviously unlikely to

correspond to the final state alignments from a well trained HMM, it has been shown

to provide a good initial point to start the Baum-Welch estimation process for audio

speech processing tasks [194].

However, from the results in this paper, it would appear that this assumption does

not necessarily stand for video speech processing tasks. While the video models for

the baseline jointly-trained SHMM were initialised and trained alongside the audio

models, the FHMM-adapted video models were trained directly on previously deter-

mined state alignments. While the final state-alignment arrived at by the Baum-Welch


algorithm of the jointly-trained SHMM trained at αtrain = 1.0 would match the state

alignment used for FHMM-adaptation there would be a clear difference in the re-

sulting speech recognition performance. It would appear that the initialisation of the

video models using the uniform segmentation has a detrimental effect on the final

video models. By determining the video parameters directly on the known good au-

dio alignments, this detrimental effect can be limited and increase the overall speech

recognition performance of the SHMM accordingly.

Speaker dependent FHMM-adaptation

The speaker dependent speech recognition results presented in Figure 8.4 unfortu-

nately couldn’t be presented alongside speaker-dependent jointly-trained SHMMmod-

els due to limitations of theHMMToolkit used for performing the joint-training. How-

ever, similar performance is obtained in comparison to the uni-modal and discrimina-

tive feature fusion for speaker dependent speech recognition.

In particular, the use of speaker-dependent SHMMs trained using the FHMM-adaptation

method allows for the speech recognition WER to remain below the video speech

recognition rate for the entire range of acoustic conditions under test. The FHMM-

adapted systems can outperform the alreadywell-performing uni-modal acousticHMMs

in clean conditions through the addition of video state models based on the acoustic

boundaries of the original HMMs.

8.6 Speaker verification experiments†

8.6.1 Introduction

As the FHMM-adaptation process of training SHMMhas allowed for speaker-dependent

speechmodels to be trained, such models can also be used for text-dependent speaker


0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

αtest

Equ

al E

rror

Rat

e (in

%)


Figure 8.6: Tuning the testing stream weight parameter αtest for speaker verificationusing FHMM-adapted SHMMs.

verification under the speech processing framework developed in Chapter 4. In this

section, the text-dependent FHMM-adapted SHMMs will be used to attempt to verify

speakers according to the speech processing framework, in comparison to the uni-

modal HMMs, discriminative feature fusion and output fusion for the same task. Re-

sults will be reported based on the EERs over the full range of acoustic noise condi-

tions under test.

8.6.2 Stream weighting

In order to conduct the speaker verification experiments using the FHMM-adapted

SHMMs, the response of this structure to the stream weighting parameter αtest un-

der various noisy conditions must first be evaluated to allow a suitable choice of the

final stream weights chosen for the unsupervised speaker verification experiments.


In order to choose this value, the speaker verification experiments using the FHMM-

adapted SHMMs were performed over a range of stream weights on the first partition

of the XM2VTS database, the results of which are shown in Figure 8.6.

In performing these tuning experiments to determine the best stream weighting pa-

rameter, one of themajor disadvantages of SHMM-based speaker verification becomes

apparent. As the stream weights are modified within the states of the HMM as each

utterance is verified, rather than as a final stage of fusing classifier scores, evaluating a

different weighting parameter requires that the entire HMM decoding process be run

through each time. In comparison, changing the weighting parameter in output score

fusion only requires to recalculation of a single mathematical equation.

Because the resulting evaluation of a full range of αtest therefore requires a very large

number of HMM-based verifications, the tuning of the stream weighting parameters

shown in Figure 8.6 is only performed on a single partition of the XM2VTS database,

rather than on all 12 configurations available under the speech processing framework.

While this approach clearly is easier than using all configurations, the resulting per-

formance curve as αtest is varied is not as finely differentiated as it would have been

if more verification experiments could have been performed for each of the stream

weighting parameters tested here.

However these curve should be adequate to broadly show the relative speaker verifi-

cation performance as αtest is varied, and the average performance over all noise levels

was used to choose an αtest of 0.2 for the FHMM-adapted SHMMs in the unsupervised

speaker verification tests. This choice of αtest also had the advantage that it was iden-

tical to that used for the late integration experiments, allowing for the two integration

strategies to be compared easily.


0 2 4 6 8 10 12 14 16 180

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Equ

al E

rror

Rat

e (in

%)

A−PLPV−LDA−MRDCTDiscriminative FFOutput FusionFHMM−adapted SHMM

Figure 8.7: Text-dependent speaker recognition performance using FHMM-adaptedHMMs over all 12 configurations of the XM2VTS database.

8.6.3 Results

The results of the unsupervised FHMM-adapted speaker verification experiments are

shown in comparison to the equivalent text-dependent discriminative feature fusion

and output score fusion as well as the uni-modal HMMs in Figure 8.7. While it can be

seen that the FHMM-adapted SHMMs perform well in comparison to early integra-

tion, it nonetheless performs catastrophically in noisy conditions, and is easily bested

by the output score fusion of the uni-modal HMMs for all of the acoustic conditions

under test.


8.6.4 Discussion

While it is certainly possible that an adaptive fusion approach could be used to allow

the streamweighting parameters to be varied based on an estimation of the prevailing

environmental conditions, a similar approach could also be applied for the output

score fusion of the uni-modal HMMs.

Indeed, in order for SHMM approaches to speaker verification to improve over simple

output score fusion of uni-modal HMMs, the SHMM approach would have to show

that it can take advantage of some temporal dependency between the speech features.

Because output-score fusion can only occur at the end of an utterance, or through an

externally determined segmentation process, it would not be able to take advantage

of the differences on a frame-by-frame basis.

However, at least for text-dependent speaker verification, such a situation is unlikely

to occur, as the main role of the HMM structure is to align the GMMswithin the HMM

against the pertinent speech events. As the stream weighting experiments in the joint-

training of SHMMs in Chapter 7 have shown, both the acoustic and visual modalities

are equally good at determining the hidden state boundaries of a known transcription,

and therefore can align state-models equally well when evaluating a known phrase for

text-dependent speaker verification.

A possible avenue of future research that may be able to take advantage of the SHMM

structure for speaker verificationmay focus on using the SHMM for limited-vocabulary

text-independent speaker verification. By allowing the SHMMs ability to find the cor-

rect transcription through the network as exhibited in the speech recognition exper-

iments in this and the previous chapter, the SHMM may be able to perform well for

speaker verification. While this may provide better performance that two uni-modal

HMMs in a similar configuration, it is not clear if this approach would be better than

considering the output of two large-vocabulary text-independent speaker verification

models in each modality.


8.7 Chapter summary

In this chapter the FHMM-adaptation method of training a SHMM through the train-

ing of secondary state models for an already existing uni-modal HMM. By using this

approach to add visual state models to already existing uni-modal acoustic HMMs,

the resulting SHMM provided improved speaker independent and dependent speech

recognition ability in all noise conditions under test, particularly so in the 0 dB SNR

conditions.

Because this approach to SHMM training could also be used to train speaker depen-

dent models, which was not possible for jointly trained SHMMs using the HMM

Toolkit [194], speaker verification experiments could also be conducted in compari-

son to the output score and feature fusion speaker verification experiments performed

in earlier chapters. However, the added complexity of the SHMM approach was not

found to give improve upon output fusion of uni-modal HMMs.

Chapter 9

Conclusions and Future Work

This chapter concludes the thesis by summarising the conclusions made in each chap-

ter and noting the original contributions made. It also provides suggestions for fur-

ther work that could extend the research reported in this thesis or that was beyond the

scope of the work reported here.

9.1 Conclusions

The central aim of the work presented in this thesis has been in the investigation of

speech and speaker recognition using both acoustic and visual speech features, with

a particular focus on the SHMM-based methods of fusing the two separate modalities

to enable automatic speech processing systems to be more robust to acoustic noise

than conventional acoustic systems. Accordingly, the work performed in this thesis

focused on four main areas:

1. To investigate the suitability of existing feature extraction and integration tech-

niques for both speech and speaker recognition.

2. To study and develop techniques to improve the audio-visual speech modelling

182 9.1 Conclusions

ability of SHMMs trained using the state-of-the-art joint-training process.

3. To develop an alternative training technique for SHMMs that can improve the

audio-visual speech modelling ability in comparison to the existing state-of-the-

art joint-training process.

4. To compare and contrast the suitability of SHMMs for speech and speaker recog-

nition in comparison to existing baseline integration techniques.

The work conducted in these four main areas has resulted in a variety of speech pro-

cessing systems and experiments conducted throughout this thesis to investigate the

suitability of each type of system to the two tasks of speech and speaker recognition.

While a number of novel contributions were presented in early chapters, as will be

outlined below, the major novel contribution of this thesis program is presented in

Chapter 8, where the FHMM-adaptation method of training SHMMs is introduced for

speech and speaker recognition. These FHMM-adapted SHMMs have demonstrated

improved speech modelling ability over jointly-trained SHMMs, as was particular

demonstrated by the improved speech recognition performance over jointly-trained

SHMMs at all levels of acoustic noise. However, for speaker verification, the improved

temporal coupling provided by the SHMMmodel did not appear to provide a signifi-

cant improvement over the late integration approach.

The major original contributions resulting from this work are summarised as follows:

1. A novel framework was presented for the evaluation of both speech and speaker

recognition on the XM2VTS database whilst reusing the same speech models in

Chapter 4. This frameworkwas used throughout the thesis, and allowed for easy

comparison between differing features and fusion techniques for both unimodal

and multimodal AVSP applications.

2. The application of dynamic CAB features, known to work well for visual speech

recognition, were tested for visual speaker verification in Chapter 5. These novel

9.1 Conclusions 183

experiments showed that the discrimination ability between speakers also im-

proved as static informationwas removed from the visual speech features through

the application of the stages of the feature extraction cascade. These results sug-

gested that the visual recognition of speakers can be improved by treating visual

speech as a behavioural rather than physiological characteristic of a speaker.

3. The effect of varying the streamweights independently during training and test-

ing of SHMMs was investigated in Chapter 7. Previous experiments in the lit-

erature on SHMM training have only dealt with the stream weights as a single

value that was the same for both the training and testing process of SHMMs.

These novel experiments showed that while varying the testing stream weights

had a large impact on the final speech recognition performance, similar changes

during the training process had a negligible impact on the final performance.

These experiments demonstrated that, as SHMM training is a more constrained

task than SHMMdecoding during testing, either audio or video features (or any

fusion of the two) can segment the training utterances equally well. This re-

sult was particularly interesting in comparison to varying the stream weights

during testing, where the fusion-ratio of the audio and video streams were of

paramount importance to the final speech recognition performance.

4. Chapter 7 also introduced two novel techniques for normalising the two au-

dio and video streams within the SHMM during decoding. The first technique

introduced was a novel adaptation of zero normalisation that normalised the

mean and variance of the video scores to be within a similar range to the acous-

tic scores, but required access to the Viterbi decoder process for implementation.

The second technique performed variance-only normalisation solely through the

adjustment of the streamweights, which allowed normalisation to occur with an

unmodified standard Viterbi decoder. Both normalisation techniques performed

similarly for audio-visual speech processing and were found to improve the ro-

bustness to acoustic noise over un-normalised SHMMs.

5. Chapter 8 introduced FHMM-adaptation, an novel alternative training technique

for SHMMs that provided improved audio-visual speechmodelling ability when

184 9.2 Future work

compared to the existing state of the art training techniques for SHMMs. Exper-

iments were conducted with the resulting FHMM-adapted SHMMs to compare

and contrast this alternative SHMM training technique against jointly-trained

SHMMs and earlier fusion methods for both speech and speaker recognition.

These experiments showed that FHMM-adaptation can improve the performance

over jointly-trained SHMMs for speech recognition over all noise levels, with a

particular improvement in noisy conditions. However, the additional compli-

cation of SHMMs for speaker verification did not appear to provide any benefit

over simple output-score fusion of uni-modal HMMs for the same task.

9.2 Future work

A number of different avenues of further work have been identified as a result of work

completed in this thesis. These can be summarised as follows:

1. In Chapter 4 a speech processing frameworkwas developed for the XM2VTS [119]

database. This framework could fairly easily be extended to other databases to

allow for similar comparative speech processing experiments to be conducted

against other datasets. One promising such dataset may be the AVICAR [96]

database, allowing for audio-visual speech processing to be tested in an auto-

motive environment.

2. While the speaker verification experiments in Chapter 5 showed that verification

error consistently decreased as static information was removed from the visual

speech features. However, as face recognition systems have shown, the static

characteristics of faces clear can be useful for the recognition of people. Whilst

it was outside the scope of this thesis, it would be interesting to make a com-

parison between static and dynamic features for whole-face recognition, paying

particular attention to the current state of the art in face recognition research.

3. While Chapter 5 demonstrated the extraction of dynamic visual speech features

9.2 Future work 185

through a discriminative feature extraction approach on the ROI, alternative

methods of dynamic feature extraction based more directly on movement in the

video, such as optical flow, would allow for an interesting comparison with the

features used in this thesis for visual speech processing.

4. While a single weighting parameter was found to work reasonable well over all

noise conditions for the SHMM-based speech recognition experiments in Chap-

ters 7 and 8, a similar approach did not appear to be viable for the output-

score fusion and SHMM-based speaker verification experiments in Chapters 6

and 8 respectively. Improved unsupervised performance could be obtained for

both speech and speaker recognition systems if the relative reliability of the two

modalities could be estimated and corresponding stream weights adjusted in

real time through an adaptive fusion process.

5. This thesis only investigated degradation in the acoustic domain. However,

it can be difficult to reliably simulate the types of visual degradation that are

present in real world conditions, and rather than look at simulating visual degra-

dation, visual degradation research may be better focused on collecting and us-

ing audio-visual speech data in real-world conditions such as the AVICAR [96]

or BANCA [6] databases.

186 9.2 Future work

Bibliography

[1] A. Adjoudani, T. Guiard-Marigny, B. L. Goff, L. Reveret, and C. Benoit, “A mul-

timedia platform for audio-visual speech processing,” in 5th European Conference

on Speech Communication and Technology. Rhodes, Greece: Institut de la Com-

munication Parlee UPRESA, September 1997, pp. 1671–1674.

[2] K. Alsabti, S. Ranka, and V. Singh, “An efficient k-means clustering algorithm,”

in IPPS/SPDPWorkshop on High Performance Data Mining, 1998.

[3] T. Artieres and P. Gallinari, “Stroke level HMMs for on-line handwriting recog-

nition,” in Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth Interna-

tional Workshop on, 2002, pp. 227–232.

[4] B. Atal, “Effectiveness of linear prediction characteristics of the speech wave for

automatic speaker identification and verification,” The Journal of the Acoustical

Society of America, vol. 55, p. 1304, 1974.

[5] R. Auckenthaler, J. Brand, J. Mason, F. Deravi, and C. Chibelushi, “Lip sig-

natures for automatic person recognition,” in Audio- and Video-based Biometric

Person Authentication (AVBPA ’99), 2nd International Conference on, Washingtion,

D.C., 1999, pp. 142–47.

[6] E. Bailly-Bailliére, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz,

J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran, “The BANCA

database and evaluation protocol,” in Audio-and Video-Based Biometric Person Au-

thentication (AVBPA 2003), 4th International Conference on, ser. Lecture Notes in

188 BIBLIOGRAPHY

Computer Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg, 2003,

pp. 625–638.

[7] J. Barron, D. Fleet, S. Beauchemin, and T. Burkitt, “Performance of optical flow

techniques,” in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR

’92., 1992 IEEE Computer Society Conference on, 1992, pp. 236–242.

[8] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. fisherfaces: recog-

nition using class specific linear projection,” Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on, vol. 19, no. 7, pp. 711–720, 1997.

[9] P. Belin, R. Zatorre, P. Lafaille, P. Ahad, and B. Pike, “Voice-selective areas in

human auditory cortex,” Nature, vol. 403, no. 6767, pp. 309–312, 2000.

[10] R. Bellman, Adaptive control processes : a guided tour. Princeton, N. J.: Princeton

University Press, 1961.

[11] S. Bengio, “Multimodal authentication using asynchronous HMMs,” in Audio-

and Video-Based Biometric Person Authentication. 4th International Conference,

AVBPA 2003. Proceedings, J. Kittler andM. Nixon, Eds. Guildford, UK: Springer-

Verlag, 2003, pp. 770–777.

[12] S. Bengio, “Multimodal speech processing using asynchronous hidden Markov

models,” Information Fusion, vol. 5, no. 2, pp. 81–9, June 2004.

[13] J. Bilmes, “A gentle tutorial on the EM algorithm and its application to parame-

ter estimation for Gaussian mixture and hidden Markov models,” International

Computer Science Institute, Tech. Rep., 1997.

[14] J. Brand, J. Mason, and S. Colomb, “Visual speech: A physiological or be-

havioural biometric?” in Audio- and Video-based Biometric Person Authentication

(AVBPA 2001), 3rd International Conference on, Halmstad, Sweden, 2001, pp. 157–

168.

[15] C. Bregler, H. Hild, S. Manke, and A.Waibel, “Improving connected letter recog-

nition by lipreading,” in Acoustics, Speech, and Signal Processing, 1993. ICASSP-

93., 1993 IEEE International Conference on, vol. 1, 1993, pp. 557–560 vol.1.

BIBLIOGRAPHY 189

[16] C. Bregler and Y. Konig, ““Eigenlips” for robust speech recognition,” in Acous-

tics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Con-

ference on, vol. ii, 1994, pp. II/669–II/672 vol.2.

[17] V. Bruce and A. Young, “Understanding face recognition.” Br J Psychol, vol. 77,

no. Pt 3, pp. 305–27, 1986.

[18] J. Campbell, J.P., “Speaker recognition: a tutorial,” Proceedings of the IEEE,

vol. 85, no. 9, pp. 1437–1462, 1997.

[19] H. Cetingul, E. Erzin, Y. Yemez, and A. Tekalp, “On optimal selection of lip-

motion features for speaker identification,” inMultimedia Signal Processing, 2004

IEEE 6th Workshop on, 2004, pp. 7–10.

[20] H. Cetingul, E. Erzin, Y. Yemez, and A. Tekalp, “Multimodal speaker/speech

recognition using lip motion, lip texture and audio,” Signal Processing, vol. In

Press, Corrected Proof, pp. –, 2006.

[21] H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp, “Discriminative lip-motion fea-

tures for biometric speaker identification,” in 2004 International Conference on

Image Processing (ICIP), vol. Vol. 3. Singapore: IEEE, 2004, p. 2023.

[22] D. Chandramohan and P. Silsbee, “A multiple deformable template approach

for visual speech recognition,” in Spoken Language, 1996. ICSLP 96. Proceedings.,

Fourth International Conference on, vol. 1, 1996, pp. 50–53 vol.1.

[23] T. Chen, “Audiovisual speech processing,” Signal Processing Magazine, IEEE,

vol. 18, no. 1, pp. 9–21, 2001.

[24] T. Chen and R. Rao, “Audio-visual integration in multimodal communication,”

Proceedings of the IEEE, vol. 86, no. 5, pp. 837–852, 1998.

[25] C. Chibelushi, F. Deravi, and J. Mason, “A review of speech-based bimodal

recognition,”Multimedia, IEEE Transactions on, vol. 4, no. 1, pp. 23–37, 2002.

[26] C. Chibelushi, S. Gandon, J. Mason, F. Deravi, and R. Johnston, “Design issues

for a digital audio-visual integrated database,” in Integrated Audio-Visual Pro-

190 BIBLIOGRAPHY

cessing for Recognition, Synthesis and Communication (Digest No: 1996/213), IEE

Colloquium on, 1996, pp. 7/1–7/7.

[27] C. Chibelushi, J. Mason, and F. Deravi, “Feature-level data fusion for bimodal

person recognition,” in Image Processing and Its Applications, 1997., Sixth Interna-

tional Conference on, vol. 1, 1997, pp. 399–403 vol.1.

[28] G. Chiou and J.-N. Hwang, “Lipreading from color video,” Image Processing,

IEEE Transactions on, vol. 6, no. 8, pp. 1192–1195, 1997.

[29] A. G. Chitu, L. J. Rothkrantz, J. C. Wojdel, and P. Wiggers, “Comparison be-

tween different feature extraction techniques for audio-visual speech recogni-

tion,” Journal on Multimodal User Interfaces, vol. 1, no. 1, 2007.

[30] T. Cootes, G. Edwards, and C. Taylor, “Active appearancemodels,” Pattern Anal-

ysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 6, pp. 681–685,

2001.

[31] R. Corkrey and L. Parkinson, “Interactive voice response: Review of studies

1989-2000,” Behavior Research Methods, Instruments, & Computers, vol. 34, no. 3,

pp. 342–353(12), August 2002.

[32] B. Dasarathy, “Sensor fusion potential exploitation-innovative architectures and

illustrative applications,” Proceedings of the IEEE, vol. 85, no. 1, pp. 24–38, 1997.

[33] K. Davis, R. Biddulph, and S. Balashek, “Automatic recognition of spoken dig-

its,” The Journal of the Acoustical Society of America, vol. 24, p. 637, 1952.

[34] S. Davis and P. Mermelstein, “Comparison of parametric representations for

monosyllabic word recognition in continuously spoken sentences,” Acoustics,

Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE

Transactions on, vol. 28, no. 4, pp. 357–366, 1980.

[35] D. Dean, P. Lucey, and S. Sridharan, “Audio-visual speaker identification us-

ing the CUAVE database,” in Auditory-Visual Speech Processing (AVSP), British

Columbia, Canada, July 24-27 2005, pp. 97–101.

BIBLIOGRAPHY 191

[36] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of syn-

chronous HMMs for audio-visual speech recognition,” Digital Signal Processing

(submitted).

[37] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Comparing audio and visual

information for speech processing,” in Eighth International Symposium on Signal

Processing and Its Applications (ISSPA), Sydney, Australia, 2005, pp. 58–61.

[38] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of multi-

stream HMMs for audio-visual speech recognition,” in Interspeech, Antwerp,

August 2007, pp. 666–669.

[39] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Weighting and normalisation

of synchronous HMMs for audio-visual speech recognition,” in Auditory-Visual

Speech Processing, Hilvarenbeek, The Netherlands, September 2007, pp. 110–115.

[40] D. Dean, S. Sridharan, and P. Lucey, “Cascading appearance based features for

visual speaker verification,” in Interspeech 2008 (accepted), 2008.

[41] D. Dean and S. Sridharan, “Dynamic visual features for audio-visual speaker

verification,” Computer Speech and Language (submitted).

[42] D. Dean and S. Sridharan, “Fused HMM adaptation of synchronous HMMs for

audio-visual speaker verification,” inAuditory-Visual Speech Processing (accepted),

2008.

[43] D. Dean, S. Sridharan, and T. Wark, “Audio-visual speaker verification using

continuous fused HMMs,” in HCSNet Workshop on the Use of Vision in HCI

(VisHCI), 2006.

[44] D. Dean, T. Wark, and S. Sridharan, “An examination of audio-visual fused

HMMs for speaker recognition,” in Second Workshop on Multimodal User Authen-

tication (MMUA), Toulouse, France, 2006.

[45] L. Debnath and S. G. Mallat, A wavelet tour of signal processing, 2nd ed. San

Diego: Academic Press, 1999.

192 BIBLIOGRAPHY

[46] L. Debnath, S. G. Mallat, K. R. Rao, and P. Yip, Discrete cosine transform : algo-

rithms, advantages, applications, 2nd ed. Boston: Academic Press, 1990.

[47] P. Duchnowski, U. Meier, and A. Waibel, “See me, hear me: Integrating auto-

matic speech recognition and lip-reading,” in Proc. Int. Conf. Speech Lang. Pro-

cess., Yokohama, 1994., 1994.

[48] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. Wiley-

Interscience, 2000.

[49] S. Dupont and J. Luettin, “Audio-visual speechmodeling for continuous speech

recognition,”Multimedia, IEEE Transactions on, vol. 2, no. 3, pp. 141–151, 2000.

[50] H. Ellis, D. Jones, and N. Mosdell, “Intra-and inter-modal repetition priming of

familiar faces and voices,” Br J Psychol, vol. 88, no. Pt 1, pp. 143–56, 1997.

[51] N. Eveno, A. Caplier, and P.-Y. Coulon, “Accurate and quasi-automatic lip track-

ing,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 14, no. 5,

pp. 706–715, 2004.

[52] M.-I. Faraj and J. Bigun, “Audio-visual person authentication using lip-motion

from orientationmaps,” Pattern Recognition Letters, vol. 28, no. 11, pp. 1368–1382,

Aug. 2007.

[53] N. Fox, R. Gross, J. Cohn, and R. Reilly, “Robust biometric person identification

using automatic classifier fusion of speech, mouth, and face experts,” Multime-

dia, IEEE Transactions on, vol. 9, no. 4, pp. 701–714, 2007.

[54] N. Fox and R. B. Reilly, “Audio-visual speaker identification based on the use

of dynamic audio and visual features,” in Audio-and Video-Based Biometric Person

Authentication (AVBPA 2003), 4th International Conference on, ser. Lecture Notes

in Computer Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg,

2003, pp. 743–751.

[55] N. A. Fox, B. A. O’Mullane, and R. B. Reilly, “Audio-visual speaker identifica-

tion via adaptive fusion using reliability estimates of bothmodalities,” vol. 3546,

2005, pp. 787–796.

BIBLIOGRAPHY 193

[56] H. Frowein, H. Frowein, G. Smoorenburg, L. Pyters, and D. Schinkel, “Improved

speech recognition through videotelephony: experiments with the hard of hear-

ing,” Selected Areas in Communications, IEEE Journal on, vol. 9, no. 4, pp. 611–616,

1991.

[57] T. Fu, X. X. Liu, L. H. Liang, X. Pi, and A. Nefian, “A audio-visual speaker

identification using coupled hidden Markov models,” in Image Processing, 2003.

Proceedings. 2003 International Conference on, vol. 3, 2003, pp. 29–32.

[58] T. Fukuda, M.-J. Jung, M. Najashima, F. Arai, and Y. Hasegawa, “Facial expres-

sive robotic head system for human-robot communication and its application in

home environment,” Proceedings of the IEEE, vol. 92, no. 11, pp. 1851–1865, 2004.

[59] K. Fukunaga, Introduction to statistical pattern recognition, 2nd ed. Boston: Aca-

demic Press, 1990.

[60] L. Girin, A. Allard, and J.-L. Schwartz, “Speech signals separation: a new ap-

proach exploiting the coherence of audio and visual speech,” in Multimedia Sig-

nal Processing, 2001 IEEE Fourth Workshop on, 2001, pp. 631–636.

[61] H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin, “Weight-

ing schemes for audio-visual fusion in speech recognition,” in Proc.

Int. Conf. Acoust. Speech Signal Process. , 2001, 2001. [Online]. Available:

citeseer.ist.psu.edu/glotin01weighting.html

[62] R. Goecke and J. Millar, “The audio-videoAustralian English speech data corpus

AVOZES,” in Proceedings of the 8th International Conference on Spoken Language

Processing ICSLP2004, vol. III, Jeju, Korea, Oct. 2004, pp. 2525–2528.

[63] R. Goecke, “A stereo vision lip tracking algorithm and subsequent statistical

analyses of the audio-video correlation in Australian English,” Ph.D. disserta-

tion, The Australian National University, Canberra, Australia, January 2004.

[64] B. Gold and N. Morgan, Speech and audio signal processing : processing and percep-

tion of speech and music. New York: Wiley, 2000.

194 BIBLIOGRAPHY

[65] O. P. E. Goldschen, A.J.; Garcia, “Continuous optical automatic speech recog-

nition by lipreading,” in Signals, Systems and Computers, 1994. 1994 Conference

Record of the Twenty-Eighth Asilomar Conference on, vol. 1, 1994, pp. 572–577 vol.1.

[66] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Commu-

nication, vol. 16, no. 3, pp. 261–291, 1995.

[67] M. Gray, J. Movellan, and T. Sejnowski, “Dynamic features for visual

speechreading: A systematic comparison,” Advances in Neural Information Pro-

cessing Systems, vol. 9, pp. 751–757, 1997.

[68] T. Hazen, K. Saenko, C. La, and J. Glass, “A segment-based audio-visual speech

recognizer: Data collection, development and initial experiments,” in Proc.

ICMI, State College, PA, 2004.

[69] M. Heckmann, F. Berthommier, and K. Kroschel, “Noise adaptive stream

weighting in audio-visual speech recognition,” EURASIP Journal on Applied Sig-

nal Processing, vol. 2002, no. 11, pp. 1260–1273, 2002.

[70] M. Heckmann, F. Berthommier, C. Savariaux, and K. Kroschel, “Effects of im-

age distortion on audio-visual speech recognition,” in ISCA Tutorial and Research

Workshop on Audio Visual Speech Processing. St Jorioz France: ISCA, 2003, pp.

163–168.

[71] M. Heckmann, K. Kroschel, C. Savariaux, and F. Berthommier, “DCT-based

video features for audio-visual speech recognition,” in International Conf. on Spo-

ken Language Processing, Denver, Colorado, 2002, pp. 92 093–0961.

[72] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” The Jour-

nal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.

[73] H. Hermansky, “Should recognizers have ears?” Speech Communication, vol. 25,

no. 1-3, pp. 3–27, August 1998.

[74] H. F. Hollien, Forensic voice identification. San Diego, Calif.: Academic Press,

2002.

BIBLIOGRAPHY 195

[75] T. Ikeda, H. Ishiguro, and M. Asada, “Adaptive fusion of sensor signals based

onmutual information maximization,” in Robotics and Automation, 2003. Proceed-

ings. ICRA ’03. IEEE International Conference on, vol. 3, 2003, pp. 4398–4402 vol.3.

[76] International Phonetic Association, Handbook of the International Phonetic Asso-

ciation : A Guide to the use of the International Phonetic Alphabet. Cambridge:

Cambridge University Press, 1999.

[77] F. Itakura, “Minimum prediction residual principle applied to speech recogni-

tion,” Acoustics, Speech, and Signal Processing, IEEE Transactions on, vol. 23, no. 1,

pp. 67–72, 1975.

[78] A. Jain, K. Nandakumar, and A. Ross, “Score normalization in multimodal bio-

metric systems,” Pattern recognition, vol. 38, no. 12, pp. 2270–2285, 2005.

[79] C. Jankowski, A. Kalyanswamy, S. Basson, and J. Spitz, “NTIMIT: a phonetically

balanced, continuous speech, telephone bandwidth speech database,” in Acous-

tics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference

on, 1990, pp. 109–112 vol.1.

[80] E. T. Jaynes and G. L. Bretthorst, Probability Theory : The Logic of Science. Cam-

bridge: Cambridge University Press, 2003.

[81] F. Jelinek, “The development of an experimental discrete dictation recognizer,”

Proceedings of the IEEE, vol. 73, no. 11, pp. 1616–1624, 1985.

[82] T. Jordan and P. Sergeant, “Effects of facial image size on visual and audio-visual

speech recognition,” Hearing by eye II: Advances in the psychology of speechreading

and auditory-visual speech, pp. 155–176, 1998.

[83] P. Jourlin, “Word-dependent acoustic-labial weights in HMM-based speech

recognition,” in Proceedings of AVSP’97, 1997, pp. 69–72. [Online]. Available:

citeseer.ist.psu.edu/279607.html

[84] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker ver-

ification,” in Audio- and Video-based Biometric Person Authentication (AVBPA ’97),

196 BIBLIOGRAPHY

First International Conference on, J. BigÃŒn, G. Chollet, and G. Borgefors, Eds.,

vol. 1. Crans-Montana, Switzerland: Springer, 1997, pp. 319–26.

[85] M. Kamachi, H. Hill, K. Lander, and E. Vatikiotis-Bateson, “‘Putting the face to

the voice’: Matching identity across modality,” Current Biology, vol. 13, no. 19,

pp. 1709–1714, Sep. 2003.

[86] A. Kanak, E. Erzin, Y. Yemez, and A. Tekalp, “Joint audio-video processing for

biometric speaker identification,” inAcoustics, Speech, and Signal Processing, 2003.

Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, vol. 2, 2003, pp.

II–377–80 vol.2.

[87] M. N. Kaynak, Q. Zhi, A. D. Cheok, K. Sengupta, Z. Jian, and K. C. Chung,

“Lip geometric features for human-computer interaction using bimodal speech

recognition: comparison and analysis,” Speech Communication, vol. 43, no. 1-2,

pp. 1–16, 2004.

[88] L. Kersta, “Voiceprint identification,” The Journal of the Acoustical Society of Amer-

ica, vol. 34, p. 725, 1962.

[89] J. Kittler, “Combining classifiers: A theoretical framework,” Pattern Analysis &

Applications, vol. 1, no. 1, pp. 18–27, March, 1998 1998.

[90] T. Kleinschmidt, D. Dean, S. Sridharan, and M. Mason, “A continuous speech

recognition evaluation protocol for the AVICAR database,” in International Con-

ference on Signal Processing and Communication Systems (ICSPCS) (accepted), 2007.

[91] B. Knappmeyer, I. M. Thornton, and H. H. Bulthoff, “The use of facial motion

and facial form during the processing of identity,”Vision Research, vol. 43, no. 18,

pp. 1921–1936, Aug. 2003.

[92] S. Kong, J. Heo, B. Abidi, J. Paik, and M. Abidi, “Recent advances in visual and

infrared face recognition–a review,” Computer Vision and Image Understanding,

vol. 97, no. 1, pp. 103–135, 2005.

[93] P. Ladefoged,A Course In Phonetics, 3rd ed. Harcourt Brace College Publishers,

1993.

BIBLIOGRAPHY 197

[94] K. Lander and L. Chuang, “Why are moving faces easier to recognize?” Visual

cognition, vol. 12, no. 3, pp. 429–442, 2005.

[95] F. Lavagetto, “Converting speech into lip movements: a multimedia telephone

for hard of hearing people,” Rehabilitation Engineering, IEEE Transactions on [see

also IEEE Trans. on Neural Systems and Rehabilitation], vol. 3, no. 1, pp. 90–102,

1995.

[96] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and

T. Huang, “AVICAR: An audiovisual speech corpus in a car environment,” in

Interspeech 2004, 2004.

[97] C.-H. Lee and J.-L. Gauvain, “Speaker adaptation based on MAP estimation of

HMM parameters,” in Acoustics, Speech, and Signal Processing, 1993. ICASSP-93.,

1993 IEEE International Conference on, vol. 2, 1993, pp. 558–561 vol.2.

[98] C.-H. Lee and J.-L. Gauvain, “Bayesian adaptive learning and MAP estimation

of HMM,” in Automatic speech and speaker recognition : Advanced topics, C.-H. Lee,

F. K. Soong, and K. K. Paliwal, Eds. Kluwer Academic, 1996, ch. 4, pp. 83–107.

[99] N. Li, S. Dettmer, and M. Shah, “Lipreading using eigensequences,” in Proc. of

Int. Workshop on Automatic Face- and Gesture-Recognition, 1995, pp. 30–34.

[100] L. Liang, X. Liu, Y. Zhao, X. Pi, and A. Nefian, “Speaker independent audio-

visual continuous speech recognition,” in Multimedia and Expo, 2002. ICME ’02.

Proceedings. 2002 IEEE International Conference on, vol. 2, 2002, pp. 25–28 vol.2.

[101] P. Lieberman, Uniquely Human: The Evolution of Speech, Thought, and Selfless Be-

havior. Harvard University Press, 1991.

[102] P. Lucey, “Lipreading across multiple views,” Ph.D. dissertation, Queensland

University of Technology, Brisbane, Australia, 2007.

[103] P. Lucey, D. Dean, and S. Sridharan, “Problems associated with area-based

visual speech feature extraction,” in Auditory-Visual Speech Processing (AVSP),

British Columbia, Canada, 2005, pp. 73–78.

198 BIBLIOGRAPHY

[104] S. Lucey and T. Chen, “Improved audio-visual speaker recognition via the use

of a hybrid combination strategy,” in Audio- and Video-Based Biometric Person Au-

thentication. 4th International Conference, AVBPA 2003. Proceedings, J. Kittler and

M. Nixon, Eds. Guildford, UK: Springer-Verlag, 2003, pp. 929–936.

[105] S. Lucey, “Audio-visual speech processing,” Ph.D. dissertation, Queensland

University of Technology, Brisbane, 2002.

[106] S. Lucey, “An evaluation of visual speech features for the tasks of speech and

speaker recognition,” in Audio-and Video-Based Biometric Person Authentication

(AVBPA 2003), 4th International Conference on, ser. Lecture Notes in Computer

Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg, 2003, pp. 260–

267.

[107] J. Luettin and G. Maitre, “Evaluation protocol for the extendedM2VTS database

(XM2VTSDB),” IDIAP, Tech. Rep., 1998.

[108] J. Luettin and N. A. Thacker, “Speechreading using probabilistic models,” Com-

puter Vision and Image Understanding, vol. 65, no. 02, pp. 163–178, 1997, iDIAP-RR

97-12.

[109] J. Luettin, N. Thacker, and S. Beet, “Learning to recognise talking faces,” in Pat-

tern Recognition, 1996., Proceedings of the 13th International Conference on, vol. 4,

1996, pp. 55–59 vol.4.

[110] J. Luettin, N. Thacker, and S. Beet, “Speechreading using shape and intensity in-

formation,” in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International

Conference on, vol. 1, 1996, pp. 58–61 vol.1.

[111] J. Luettin, “Visual speech and speaker recognition,” Ph.D. dissertation, Univer-

sity of Sheffield, May 1997.

[112] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The

DET curve in assessment of detection task performance,” vol. 97, no. 4, 1997,

pp. 1895–1898.

BIBLIOGRAPHY 199

[113] J. Mason and J. Brand, “The role of dynamics in visual speech biometrics,” in

Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE In-

ternational Conference on, vol. 4, 2002, pp. IV–4076–IV–4079 vol.4.

[114] I. Matthews, J. Bangham, and S. Cox, “Audiovisual speech recognition using

multiscale nonlinear image decomposition,” in Spoken Language, 1996. ICSLP 96.

Proceedings., Fourth International Conference on, vol. 1, 1996, pp. 38–41 vol.1.

[115] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. Bangham, “Lipreading using

shape, shading and scale,” in Proc. of Audio Visual Speech Processing 1998 (AVSP

1998), 1998.

[116] I. Matthews, G. Potamianos, C. Neti, and J. Luettin, “A comparison of model

and transform-based visual features for audio-visual LVCSR,” inMultimedia and

Expo, 2001. ICME 2001. IEEE International Conference on, 2001, pp. 825–828.

[117] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol.

264, no. 5588, pp. 746–748, Dec. 1976.

[118] U.Meier, R. Stiefelhagen, J. Yang, and A.Waibel, “Towards unrestricted lip read-

ing,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 14,

no. 5, pp. 571–585, 2000.

[119] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The ex-

tended M2VTS database,” in Audio and Video-based Biometric Person Authentica-

tion (AVBPA ’99), Second International Conference on, Washington D.C., 1999, pp.

72–77.

[120] P. Motlícek, L. Burget, and J. Cernocký, “Phoneme recognition ofmeetings using

audio-visual data,” in Joint AMI/PASCAL/IM2/M4workshop, Martigny, CH, 2004.

[121] J. Movellan, “Visual speech recognition with stochastic networks,” in Advances

in neural information processing systems, G. Tesauro, D. Touretzky, and T. Leen,

Eds. San Mateo, CA: MIT Press Cambridge, 1995, vol. 7, pp. 851–858.

200 BIBLIOGRAPHY

[122] A. V. Nefian and L. H. Liang, “Bayesian networks inmultimodal speech recogni-

tion and speaker identification,” Conference Record of the Thirty-Seventh Asilomar

Conference on Signals, Systems and Computers; Conference Record of the Asilomar

Conference on Signals, Systems and Computers, vol. 2, pp. 2004–2008, 2003.

[123] A. V. Nefian, L. H. Liang, T. Fu, and X. X. Liu, “A Bayesian approach to audio-

visual speaker identification,” in Audio-and Video-Based Biometric Person Authen-

tication (AVBPA 2003), 4th International Conference on, ser. Lecture Notes in Com-

puter Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg, 2003, pp.

761–769.

[124] A. V. Nefian, L. Liang, X. Pi, and X. Liu, “Dynamic bayesian networks for audio-

visual speech recognition,” EURASIP Journal in Applied Signal Processing, vol. 11,

pp. 1–15, 2002.

[125] A. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, “A coupled

HMM for audio-visual speech recognition,” in Acoustics, Speech, and Signal Pro-

cessing, 2002. Proceedings. (ICASSP ’02). IEEE International Conference on, vol. 2,

2002, pp. 2013–2016.

[126] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison,

A.Mashari, and J. Zhou, “Audio-visual speech recognition: Workshop2000 final

report,” Johns Hopkins University, CLSP, Tech. Rep. WS00AVSR, 2000.

[127] H. Olson and H. Belar, “Phonetic typewriter,” Audio, IRE Transactions on, vol. 5,

no. 4, pp. 90–95, 1957.

[128] A. O’Toole, D. Roark, and H. Abdi, “Recognizing moving faces: A psychological

and neural synthesis,”Trends in Cognitive Sciences, vol. 6, no. 6, pp. 261–266, 2002.

[129] H. Ouyang and T. Lee, “A new lip feature representation method for video-

based bimodal authentication,” in MMUI ’05: Proceedings of the 2005 NICTA-

HCSNet Multimodal User Interaction Workshop. Darlinghurst, Australia, Aus-

tralia: Australian Computer Society, Inc., 2006, pp. 33–37.

BIBLIOGRAPHY 201

[130] H. Pan, S. Levinson, T. Huang, and Z.-P. Liang, “A fused hidden markov model

with application to bimodal speech processing,” IEEE Transactions on Signal Pro-

cessing, vol. 52, no. 3, pp. 573–581, 2004.

[131] H. Pan, Z.-P. Liang, and T. S. Huang, “Estimation of the joint probability of mul-

tisensory signals,” Pattern Recognition Letters, vol. 22, no. 13, pp. 1431–1437, 2001.

[132] P. Papamichalis, Practical approaches to speech coding. EnglewoodCliffs: Prentice-

Hall, 1987.

[133] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: a new audio-

visual database for multimodal human-computer interface research,” in Acous-

tics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE Interna-

tional Conference on, vol. 2, 2002, pp. 2017–2020.

[134] D. Paul and J. Baker, “The design for theWall Street Journal-based CSR corpus,”

in Proceedings of the DARPA Speech and Natural Language Workshop, 1992, pp. 357–

362.

[135] E. Petajan, B. Bischoff, D. Bodoff, and N. M. Brooke, “An improved automatic

lipreading system to enhance speech recognition,” in CHI ’88: Proceedings of the

SIGCHI conference on Human factors in computing systems. New York, NY, USA:

ACM, 1988, pp. 19–25.

[136] E. D. Petajan, “Automatic lipreading to enhance speech recognition,” in Proc.

Global Telecomm. Conf., 1984, pp. 265–272.

[137] S. Pigeon. (1998, 5/4/2004) M2VTS multimodal face database. [Online]. Avail-

able: http://www.tele.ucl.ac.be/PROJECTS/M2VTS/m2fdb.html

[138] S. Pigeon. (1998, October) M2VTS project: Multi-

modal biometric person authentication. [Online]. Available:

http://www.tele.ucl.ac.be/PROJECTS/M2VTS/

[139] K. Pilz, I. Thornton, and H. Bülthoff, “A search advantage for faces learned in

motion,” Experimental Brain Research, vol. 171, no. 4, pp. 436–447, 2006.

202 BIBLIOGRAPHY

[140] G. Potamianos, E. Cosatto, H. Graf, and D. Roe, “Speaker independent audio-

visual database for bimodal ASR,” in Proc. Europ. Tut. Work. Audio-Visual Speech

Proc, Rhodes, 1997, pp. 65–68.

[141] G. Potamianos and H. Graf, “Discriminative training of HMM stream exponents

for audio-visual speech recognition,” in Acoustics, Speech, and Signal Processing,

1998. ICASSP ’98. Proceedings of the 1998 IEEE International Conference on, vol. 6,

1998, pp. 3733–3736 vol.6.

[142] G. Potamianos, H. Graf, and E. Cosatto, “An image transform approach for

HMM based automatic lipreading,” in Image Processing, 1998. ICIP 98. Proceed-

ings. 1998 International Conference on, 1998, pp. 173–177 vol.3.

[143] G. Potamianos and C. Neti, “Automatic speechreading of impaired speech,” in

International Conference on Auditory-Visual Speech Processing (AVSP 2001). Aal-

borg, Denmark: ISCA, September 7-9 2001.

[144] G. Potamianos and C. Neti, “Improved ROI and within frame discriminant fea-

tures for lipreading,” in Image Processing, 2001. Proceedings. 2001 International


[145] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “Recent advances in

the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91,

no. 9, pp. 1306–1326, 2003.

[146] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-visual automatic

speech recognition: An overview,” in Issues in Visual and Audio-Visual Speech

Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Eds. MIT Press, 2004.

[147] G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade image

transform for speaker independent automatic speechreading,” inMultimedia and

Expo, 2000. ICME 2000. 2000 IEEE International Conference on, vol. 2, 2000, pp.

1097–1100 vol.2.

[148] G. Potamianos and P. Lucey, “Audio-visual asr frommultiple views inside smart

BIBLIOGRAPHY 203

rooms,” in Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE

International Conference on, 2006, pp. 35–40.

[149] G. Potamianos, C. Neti, G. Iyengar, A. W. Senior, and A. Verma, “A cascade vi-

sual front end for speaker independent automatic speechreading,” International

Journal of Speech Technology, vol. 4, no. 3, pp. 193–208, Jul 2001.

[150] R. Potter, “Visible patterns of sound,” Science, vol. 102, no. 2654, pp. 463–470,

1945.

[151] J. Psutka, L. Müller, and J. V. Psutka, “Comparison of mfcc and plp parameteri-

zations in the speaker independent continuous speech recognition task,” in 7th

European Conference on Speech Communication and Technology, Aalborg, Denmark,

September 3-7 2001.

[152] L. Rabiner, S. Levinson, A. Rosenberg, and J. Wilpon, “Speaker-independent

recognition of isolatedwords using clustering techniques,”Acoustics, Speech, and

Signal Processing, IEEE Transactions on, vol. 27, no. 4, pp. 336–349, 1979.

[153] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition. Englewood

Cliffs, N.J :: PTR Prentice Hall„ 1993, includes bibliographical references and

index.

[154] L. Rabiner, “The role of voice processing in telecommunications,” in Interactive

Voice Technology for Telecommunications Applications, 1994., Second IEEE Workshop

on, 1994, pp. 1–8.

[155] D. R. Reddy, “Approach to computer speech recognition by direct analysis of

the speech wave,” The Journal of the Acoustical Society of America, vol. 40, no. 5,

pp. 1273–1273, 1966.

[156] D. Reisberg, J. McLean, and A. Goldfield, “Easy to hear but hard to under-

stand: A lip-reading advantage with intact auditory stimuli,” in Hearing by Eye,

B. Dodd and R. Campbell, Eds. London: Lawrence Erlbaum Associates, 1987,

ch. 4, pp. 97–113.

204 BIBLIOGRAPHY

[157] D. Reynolds, “Experimental evaluation of features for robust speaker identifica-

tion,” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 4, pp. 639–643,

1994.

[158] D. Reynolds and R. Rose, “Robust text-independent speaker identification us-

ing Gaussian mixture speaker models,” Speech and Audio Processing, IEEE Trans-

actions on, vol. 3, no. 1, pp. 72–83, 1995.

[159] M. Roach, J. Brand, and J. Mason, “Acoustic and facial features for speaker

recognition,” in Pattern Recognition, 2000. Proceedings. 15th International Confer-

ence on, vol. 3, 2000, pp. 258–261 vol.3.

[160] D. Roark, A. JO’Toole, H. Abdi, and S. Barrett, “Learning the moves: The effect

of familiarity and facial motion on person recognition across large changes in

viewing format,” Perception, vol. 35, pp. 761–773, 2006.

[161] A. A. Ross, K. Nandakumar, and A. K. Jain, Handbook of Multibiometrics, ser.

International Series on Biometrics, D. D. Zhang and A. K. Jain, Eds. Springer,

2006.

[162] L. Rothkrantz, J. Wojdel, and P.Wiggers, “Comparison between different feature

extraction techniques in lipreading applications,” in Specom 2006, 2006.

[163] C. Sanderson, “The VidTIMIT database,” IDIAP Communication, pp. 02–06, 2002.

[164] P. Scanlon and R. Reilly, “Feature analysis for automatic speechreading,” inMul-

timedia Signal Processing, 2001 IEEE Fourth Workshop on, 2001, pp. 625–630.

[165] S. Schweinberger, “Hearing facial identities,” The Quarterly Journal of Experimen-

tal Psychology, vol. 99999, no. 1, pp. 1–1, 2007.

[166] S. Schweinberger, A. Herholz, and V. Stief, “Auditory long term memory: Rep-

etition priming of voice recognition,” The Quarterly Journal of Experimental Psy-

chology Section A, vol. 50, no. 3, pp. 498–517, 1997.

[167] J. Shepherd, G. Davies, and H. Ellis, “Studies of cue saliency,” in Perceiving and

Remembering Faces, 1981, pp. 105–131.

BIBLIOGRAPHY 205

[168] P. Silsbee and A. Bovik, “Computer lipreading for improved accuracy in au-

tomatic speech recognition,” Speech and Audio Processing, IEEE Transactions on,

vol. 4, no. 5, pp. 337–351, 1996.

[169] G. L. E. Stork, D.G.; Wolff, “Neural network lipreading system for improved

speech recognition,” in Neural Networks, 1992. IJCNN., International Joint Confer-

ence on, vol. 2, June 1992, pp. 289–295.

[170] W. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,”

The Journal of the Acoustical Society of America, vol. 26, p. 212, 1954.

[171] Q. Summerfield, “Lipreading and audio-visual speech perception,” Philosophical

Transactions: Biological Sciences, vol. 335, no. 1273, pp. 71–78, 1992.

[172] Q. Summerfield, “Some preliminaries to a comprehensive account of audio-

visual speech perception,” in Hearing by Eye, B. Dodd and R. Campbell, Eds.

London: Lawrence Erlbaum Associates, 1987, ch. 1, pp. 1–51.

[173] P. Teissier, J.-L. Schwartz, and A. Guerin-Dugue, “Models for audiovisual fusion

in a noisy-vowel recognition task,” in Multimedia Signal Processing, 1997., IEEE

First Workshop on, 1997, pp. 37–44.

[174] R. Tenney and N. Sandell, “Detection with distributed sensors,” Aerospace and

Electronic Systems, IEEE Transactions on, vol. AES-17, no. 4, pp. 501–510, 1981.

[175] M. Turk and A. Pentland, “Face recognition using eigenfaces,” in Computer Vi-

sion and Pattern Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society

Conference on, 1991, pp. 586–591.

[176] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition

ii: Noisex-92: a database and an experiment to study the effect of additive noise

on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247–

251, 1993.

[177] E. Vatikiotis-Bateson, G. Bailly, and P. Perrier, Eds., Audio-Visual Speech Process-

ing. MIT Press, 2006.

206 BIBLIOGRAPHY

[178] T. K. Vintsyuk, “Speech discrimination by dynamic programming,” Cybernetics

and Systems Analysis, vol. 4, no. 1, pp. 52–57, Jan 1968.

[179] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple

features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceed-

ings of the 2001 IEEE Computer Society Conference on, vol. 1, 2001, pp. I–511–I–518

vol.1.

[180] K. von Kriegstein, A. Kleinschmidt, P. Sterzer, and A.-L. Giraud, “Interaction

of face and voice areas during speaker recognition,” Journal of Cognitive Neuro-

science, vol. 17, no. 3, pp. 367–376, 2005.

[181] T. Wagner and U. Dieckmann, “Multi-sensorial inputs for the identification of

persons with synergetic computers,” in Image Processing, 1994. Proceedings. ICIP-

94., IEEE International Conference, vol. 2, 1994, pp. 287–291 vol.2.

[182] T. Wagner and U. Dieckmann, “Sensor-fusion for robust identification of per-

sons: a field test,” in Image Processing, 1995. Proceedings., International Conference

on, vol. 3, 1995, pp. 516–519 vol.3.

[183] G. K. Wallace, “The JPEG still picture compression standard,” Commun. ACM,

vol. 34, no. 4, pp. 30–44, 1991.

[184] S. Wang, W. Lau, S. Leung, and H. Yan, “A real-time automatic lipreading sys-

tem,” in Circuits and Systems, 2004. ISCAS ’04. Proceedings of the 2004 International

Symposium on, vol. 2, 2004, pp. II–101–4 Vol.2.

[185] T. Wark and S. Sridharan, “A syntactic approach to automatic lip feature extrac-

tion for speaker identification,” in Acoustics, Speech, and Signal Processing, 1998.

ICASSP ’98. Proceedings of the 1998 IEEE International Conference on, vol. 6, 1998,

pp. 3693–3696 vol.6.

[186] T. Wark and S. Sridharan, “Adaptive fusion of speech and lip information for

robust speaker identification,” Digital Signal Processing, vol. 11, no. 3, pp. 169–

186, 2001.

BIBLIOGRAPHY 207

[187] T. Wark, S. Sridharan, and V. Chandran, “Robust speaker verification via asyn-

chronous fusion of speech and lip information,” in Audio- and Video-based Bio-

metric Person Authentication (AVBPA ’99), 2nd International Conference on, Wash-

ingtion, D.C., 1999, pp. 37–42.

[188] T. Wark, S. Sridharan, and V. Chandran, “The use of temporal speech and lip

information for multi-modal speaker identification via multi-streamHMMs,” in

Acoustics, Speech, and Signal Processing, 2000. ICASSP ’00. Proceedings. 2000 IEEE

International Conference on, vol. 6, 2000, pp. 2389–2392 vol.4.

[189] T. Wark, D. Thambiratnam, and S. Sridharan, “Person authentication using lip

information,” in TENCON ’97. IEEE Region 10 Annual Conference. Speech and Im-

age Technologies for Computing and Telecommunications’., Proceedings of IEEE, vol. 1,

1997, pp. 153–156 vol.1.

[190] J. Webb and E. Rissanen, “Speaker identification experiments using HMMs,” in

Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International


[191] P. L. Williams, Gray’s anatomy of the human body, 20th ed. Churchill Livingstone,

1918.

[192] M.-H. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: a survey,”

Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 1, pp.

34–58, 2002.

[193] Y. Yemez, A. Kanak, E. Erzin, and A. Tekalp, “Multimodal speaker identifica-

tion with audio-video processing,” in Image Processing, 2003. Proceedings. 2003

International Conference on, vol. 3, 2003, pp. 5–8.

[194] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey,

V. Valtchev, and P. Woodland, The HTK Book, 3rd ed. Cambridge, UK: Cam-

bridge University Engineering Department., 2002.

[195] M. J. S. T. Yuhas, B.P.; Goldstein, “Integration of acoustic and visual speech sig-

208 BIBLIOGRAPHY

nals using neural networks,” in Communications Magazine, IEEE, vol. 27, no. 11,

November 1989, pp. 65–71.

[196] A. L. Yuille, P.W.Hallinan, andD. S. Cohen, “Feature extraction from faces using

deformable templates,” International Journal of Computer Vision, vol. 8, no. 2, pp.

99–111, August 1992.

[197] Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. Huang, and S. Levin-

son, “Audio-visual affect recognition throughmulti-stream fused hmm for hci,”

in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-

ciety Conference on, vol. 2, 2005, pp. 967–972.

[198] X. Zhang and R. Mersereau, “Lip feature extraction towards an automatic

speechreading system,” in Image Processing, 2000. Proceedings. 2000 International


[199] W. Zhao, R. Chellappa, P. Phillips, and A. Rosenfeld, “Face recognition: A lit-

erature survey,” ACM Computing Surveys (CSUR), vol. 35, no. 4, pp. 399–458,

2003.

Documents

Synchronous HMMs for Audio-Visual Speech Processingeprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf · Synchronous HMMs for Audio-Visual Speech Processing by David Dean, BEng (Hons),