1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

1

Regularized Adaptation: Theory, Algorithms and Applications

Xiao Li

Electrical Engineering Department

University of Washington

2

Goal

When a target distribution pad(x,y) is close to but different from the training distribution ptr(x,y)

A classifier optimized for ptr(x,y) needs to be adapted for pad(x,y), using a small amount of labeled adaptation data

3

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Application to the Vocal Joystick Conclusions and future work

4

Inductive Learning Given

a set of m samples (xi, yi) ~p(x, y) a decision function space F: X {±1}

Goal learn a decision function that minimizes the

expected error

In practice minimize the empirical error

while applying a regularization strategy to achieve good generalization performance

f F

( , ) ( , ) ( , )( ) E [I( ( ) )] p x y x y p x yR f f x y

m

empi=1

1( ) I( ( ) ) i iR f f x y

m

5

Why Is Regularization Helpful?

Learning theory says

Vapnik’s VC bound (a frequentist view): Φ VC dimension of F Occam’s Razor bound (a Bayesian view): Φ π( f ) for

countable function space

“Accuracy-regularization” We want to minimize the empirical error as well as the capacity Support vector machines (a frequentist view) Bayesian model selection (a Bayesian view)

empPr , ( ) ( ) ( , , , ) 1f F R f R f F f m

6

Adaptive Learning (Adaptation)

Two related yet different distributions Training Target (test-time)

Given An unadapted model Adaptation data (labeled)

Goal Learn an adapted model that is as close as possible to our

desired model

Notes Adaptation data Dm may be very limited

Original data for training f tr is not preserved for adaptation

( , )adp x y( , )trp x y

( , )arg min ( )

tr

tr

p x yf F

f R f

1{( , ) ( , )} ad mm i i iD x y p x y

( , )arg min ( )

ad

ad

p x yf F

f R f

7

Scenarios

Customization Speech recognition: speaker adaptation Handwriting recognition: writer adaptation Language processing: domain adaptation

Evolutionary environment Spam filter

8

Practical Work on Adaptation

Gaussian mixture models (GMMs) MAP (Gauvain 94); MLLR (Leggetter 95)

Support vector machines (SVMs) Boosting-like approach (Matic 93) Weighted combination of old support vectors and adaptation

data (Wu 04)

Multi-layer perceptrons (MLPs) Shared “internal representation” in transfer learning (Baxter 95,

Caruana 97, Stadermann 05) Linear input network (Neto 95)

Conditional maximum entropy models Gaussian prior (Chelba 04)

9

This Work Seeks Answers to …

A unified and principled approach to adaptation Applicable to a variety of classifiers Amenable to variations in the amount of adaptation data

Quantitative relationships between The generalization error bound (or sample complexity bound)

and The divergence between training and target distributions

10

Roadmap





11

Bayesian Fidelity Prior

Adaptation objective

Remp( f ) – empirical error on the adaptation data Pfid( f ) – Bayesian “fidelity prior”

Fidelity prior

How likely a classifier is given a training distribution

rather than a training set – key difference from Baxter 97 Reflects the “fidelity” to the unadapted model Related to the KL-divergence

emp fidmin ( ) ln ( )

f F

R f p f

fid ( , )ln ( ) E [ln ( | , )] trp x y

p f p f x y

12

Generative Models

Generative models F – a space of generative models p( x, y | f )

Classification

Posterior

Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e.

Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss

( , | ) ( )( | , )

( , | ) ( )

p x y f fp f x y

p x y f f

arg max[log ( , | )]y

p x y f

( , | ) ( , )

( , | ) ( , )

tr tr

ad ad

p x y f p x y

p x y f p x y

standard prior, chosen before training

13

Fidelity Prior for Generative Models

Key result

where β > 0

Implication Fidelity prior at the desired model

We are more likely to learn our desired model using the fidelity prior than using the standard prior

fidln ( ) ( || ) ln ( ) ad tr ad adp f D p p f

fid( || ) ( ) ( ) tr ad ad adD p p p f f

fidln ( ) ( ( , | ) || ( , | )) ln ( ) trp f D p x y f p x y f f

14

Instantiations

To compute the fidelity prior Assuming a “uniform” standard prior, this prior is determined by

the KL-divergence We can use an upper bound on the KL-divergence (hence a

lower bound on the prior)

Gaussian models The fidelity prior is a normal-Wishart distribution

Mixture models An upper bound on the KL-divergence (using log sum inequality)

Hidden Markov models An upper bound on the KL-divergence (Silva 06)

15

Discriminative Models

A unified view of SVMs, MLPs, CRFs and etc. Affine classifiers in a transformed space f = ( w, b ) Classification

Conditional distribution (for binary case)

sgn( ( ) )Tw x b

Φ(x) Loss function

SVMs Determined by the kernel Hinge loss

MLPs Hidden neurons Log loss

CRFs Any feature function Log loss

( ( ) )

1( | , )

1Ty w x b

p y x fe

16

Discriminative Models (cont.)

Conditional models F – a space of conditional models p( y | x, f )

Classification

Posterior

Assume f tr and f ad are the true models describing the training and target conditional distributions respectively, i.e.

arg max[log ( | , )]y

p y x f

( | , ) ( )( | , )

( | , ) ( )

p y x f fp f x y

p y x f f

( | , ) ( | )

( | , ) ( | )

tr tr

ad ad

p y x f p y x

p y x f p y x

17

Fidelity Prior for Conditional Models

Again a divergence

where β > 0 What if we do not know ptr(x, y) We seek an upper bound on the KL-divergence and hence a

lower bound on the prior

Key result

where

fidln ( ) ( ( | , ) || ( | , )) ln ( ) trp f D p y x f p y x f f

E[|| ||]R x

( ( | , ) || ( | , )) || || | |tr tr trD p y x f p y x f R w w b b

18

Roadmap




Learning and adaptation in the Vocal Joystick Conclusions and future work

19

Occam’s Razor Bound for Adaptation

(McAllester 98) For a countable function space, for any prior π( f )

Apply for adaptation …

emp

ln ( ) lnPr , ( ) ( ) 1

2

ff F R f R f

m

( ( , | ) || ( , | )) ln ( ) ln

2

trD p x y f p x y f f

m

20

sample size m

Bound using standard prior

Bounds using fidelity prior

( ( , | ) || ( , | )) ln ( ) ln( , , )

2

trD p x y f p x y f ff m

m

( , , ) f m Increasing KL

21

PAC-Bayesian Bounds for Adaptation

(McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f )

Choice of prior p( f ) and posterior q( f ) Use pfid( f ) or its related forms as prior

Choose q( f ) to have the same parametric form

Examples Gaussian models Linear classifier

f q(f) f q(f) emp

( ( ) || ( )) ln ln 2Pr , E [ ( )] E [ ( )] 1

2 1

D q f p f mf R f R f

m

1( ( ) || ( )) ( ') ( ') / 2 tr T trD q f p f2( ( ) || ( )) || ' || / 2 trD q f p f w w

22

Roadmap




Application the Vocal Joystick Conclusions and future work

23

Algorithms Derived from the Fidelity Prior

Generative Models

Relate to MAP adaptation

Conditional Models

In particular for log linear models

We focus on SVMs and MLPs

empmin ( ) ( ( , | ) || ( , | ))

tr

f FR f D p x y f p x y f

empmin ( ) ( ( | , ) || ( | , ))

tr

f FR f D p y x f p y x f

2 2emp 1 2min ( ) || || | |

tr tr

f FR f w w b b

24

Regularized SVM Adaptation

Optimization objective

Globally optimal solution

Regularized – fixing old support vectors and their coefficients Extended regularized – update coefficients of old support

vectors as well, with two separate constraints in the dual space

2

1

1min || ||

2

s.t. 1 ( ( ) )

0

mtr

ii

Ti i i

i

w wm

y w x b

| |

1 1

( ) ( , ) ( , )

SV m

tr tr tri i i i i i

i i

f x y k x x y k x x

25

Algorithms in Comparison Unadapted Retrained

Use adaptation data only Boosted (Matic 93)

Select adaptation data misclassified by the unadapted model Combine with old support vectors

Bootstrapped (proposed in thesis) Train a seed classifier using adaptation data only Select old support vectors correctly classified by the seed classifier Combine with adaptation data

Regularized and Extended regularized

26

Regularized MLP Adaptation

Optimization objective for a multi-class, two-layer MLP

Wh2o and Wi2h – the hidden-to-output and input-to-hidden layer weight matrix respectively

||W|| = tr(WTW) Remp( f ) – cross-entropy, corresponding to logistic loss

Locally optimal solution found using back-propagation

2 2

2 22 2 2 2

,min || || || || ( )

2 2h o i h

tr trh o h o i h i h emp

W WW W W W R f

27

Algorithms in Comparison Unadapted Retrained

Start from randomly initialized weights and train with weight decay Linear input network (Neto 95)

Add a linear transformation in the input space Retrained speaker-independent (Neto 95)

Start from unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05)

Start from unadapted; only train the last layer Retrained first layer (proposed in thesis)

Start from unadapted; only train the first layer Regularized

Many algorithms above can be considered as special cases of regularized

28

Roadmap





29

Experimental Paradigm

Procedure Train an unadapted model on training set Adapt (with supervision) and evaluate via n-fold CV on test set Select regularization coefficients on the dev set

Corpora VJ vowel dataset (Kilanski 06) NORB image dataset (LeCun 04)

30

VJ Vowel Dataset Task

8 vowel classes Frame-level classification error rate Speaker adaptation

Data allocation Training set – 21 speakers, 420K samples

For SVM, we randomly selected 80K samples for training Test set – 10 speakers, 200 samples Dev set – 4 speakers, 80 samples

Features 182 dimensions – 7 frames of MFCC+delta features

31

SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev of frame-level error rates over 10 speakers

# adapt. samples per speaker

1K 2K 3K

Unadapted 38.21 ± 4.68 38.21 ± 4.68 38.21 ± 4.68

Retrained 24.70 ± 2.47 18.94 ± 1.52 14.00 ± 1.40

Boosted 29.66 ± 4.60 26.54 ± 2.49 28.85 ± 2.03

Bootstrapped 26.16 ± 5.07 19.24 ± 1.32 14.41 ± 1.26

Regularized 23.28 ± 4.21 19.01 ± 1.30 15.00 ± 1.41

Ext. regularized 28.55 ± 4.99 25.38 ± 2.49 20.36 ± 2.08

red are the best and those not significantly different from the best at p<0.001 level

32

MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers

# adapt. samples per speaker

1K 2K 3K

Unadapted 32.03 ± 3.76 32.03 ± 3.76 32.03 ± 3.76

Retrained (reg.) 14.21 ± 2.50 10.06 ± 3.15 9.09 ± 3.92

Linear input 13.52 ± 2.22 11.81 ± 2.33 11.96 ± 1.77

Retrained SI 12.15 ± 2.70 9.64 ± 2.74 7.88 ± 2.49

Retrained last 15.45 ± 2.75 13.32 ± 2.46 11.40 ± 2.37

Retrained first 11.56 ± 2.09 9.12 ± 3.08 7.35 ± 2.26

Regularized 11.56 ± 2.09 8.16 ± 2.60 7.30 ± 2.40

33

05

1015

2025

3035

4045

50

1 2 3 4 5 6 7 8

Num. vowel classes used for adaptation

Err

or

rate

on

all

vow

el c

lass

es

Linear input net

Retrained SI

Retrained last

Retrained first

Regularized

MLP Adaptation (II) Varying number of vowel classes available in adaptation data

34

NORB Image Dataset

Task 5 object classes Classify each image Lighting condition adaptation

Data allocation Training set – 2700 samples Test set – 2700 samples

Features 32x32 raw images

35

SVM Adaptation

# adapt. samples per lighting condition

90 180

Unadapted 12.5 ± 2.7 12.5 ± 2.7

Retrained 30.1 ± 3.5 18.9 ± 2.5

Boosted 11.5 ± 2.0 10.7 ± 1.5

Bootstrapped 21.9 ± 4.8 14.9 ± 2.5

Regularized 14.8 ± 1.9 13.4 ± 1.9

Ext. regularized 11.0 ± 1.7 10.4 ± 1.9

RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev of classification error rates over 6 lighting conditions

red are the best and those not significantly different from the best at p<0.001 level

36

MLP Adaptation

# adapt. samples per lighting condition

90 180

Unadapted 21.4 ± 6.6 21.4 ± 6.6

Retrained (reg) 41.9 ± 15.8 32.5 ± 12.8

Retrained SI 19.6 ± 4.3 18.6 ± 4.3

Regularized 19.2 ± 3.4 18.0 ± 3.7

30 hidden nodes Mean and std. dev over 6 lighting conditions

37

Roadmap





38

Why the Vocal Joystick

Hands-free computer interfaces Eye-gaze tracking – high cost Head-mouse or chin-joystick – low bandwidth Speech interfaces – inability for continuous control

The Vocal Joystick (Bilmes 05) Can control both continuous and discrete tasks. Utilizes intensity, vowel quality, pitch and discrete sound identity The VJ-mouse control scheme

Joint work with Graduate students: J. Malkin, S. Harada and K. KilanskiFaculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay,

P. Dowden and H. Chizeck

39

40

The Pattern Recognition Module

Pitch Tracking

Vowel Classification

Discrete Sound Recognition

Pitch

Vowel posteriors

Discretesound ID

Adaptive Filter

Discrete Sound Detection

Graphical Model

Two-Layer MLP

PhonemeHMMs

Regularized MLP Adaptation

Regularized GMM Adaptation

Vowel Detection

IntensityZero-crossAutocovarainceMFCCs

41

Vowel Classifier Adaptation

Classifier Two-layer MLP with 4 or 8 outputs Trained on the VJ-vowel dataset

Adaptation Motivation: overlap of different vowel classes articulated from

different speakers Regularized adaptation for MLPs Improvements over unadapted classifier (with 2 secs per vowel):

4-class 7.6%0.2%; 8-class: 32.0%8.2%

42

Discrete Sound Recognizer Adaptation

Recognizer Phoneme HMMs Trained on TIMIT

Rejection “In-vocabulary”: unvoiced consonants Garbage utterances: vowels, breathing, extraneous speech Heuristics for rejection: duration, zero-crossing, pitch, posteriors

Adaptation Motivation: consonant articulation mismatch between TIMIT data

and the VJ application Regularized adaptation for GMMs Obtain better rejection thresholds

43

Conclusions and Future Work

Key contributions A Bayesian fidelity prior and generalization error bounds

Regularized adaptation algorithms derived from the fidelity prior

Application to the Vocal Joystick

Future work Theoretical work: adaptation error bounds for classifiers in an

uncountable function space

Algorithmic work: kernel adaptation

The Vocal Joystick: discrete sound data collection and training a better baseline

44

Thank You

Documents

1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington