44
1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

1

Regularized Adaptation: Theory, Algorithms and Applications

Xiao Li

Electrical Engineering Department

University of Washington

Page 2: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

2

Goal

When a target distribution pad(x,y) is close to but different from the training distribution ptr(x,y)

A classifier optimized for ptr(x,y) needs to be adapted for pad(x,y), using a small amount of labeled adaptation data

Page 3: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

3

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Application to the Vocal Joystick Conclusions and future work

Page 4: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

4

Inductive Learning Given

a set of m samples (xi, yi) ~p(x, y) a decision function space F: X {±1}

Goal learn a decision function that minimizes the

expected error

In practice minimize the empirical error

while applying a regularization strategy to achieve good generalization performance

f F

( , ) ( , ) ( , )( ) E [I( ( ) )] p x y x y p x yR f f x y

m

empi=1

1( ) I( ( ) ) i iR f f x y

m

Page 5: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

5

Why Is Regularization Helpful?

Learning theory says

Vapnik’s VC bound (a frequentist view): Φ VC dimension of F Occam’s Razor bound (a Bayesian view): Φ π( f ) for

countable function space

“Accuracy-regularization” We want to minimize the empirical error as well as the capacity Support vector machines (a frequentist view) Bayesian model selection (a Bayesian view)

empPr , ( ) ( ) ( , , , ) 1f F R f R f F f m

Page 6: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

6

Adaptive Learning (Adaptation)

Two related yet different distributions Training Target (test-time)

Given An unadapted model Adaptation data (labeled)

Goal Learn an adapted model that is as close as possible to our

desired model

Notes Adaptation data Dm may be very limited

Original data for training f tr is not preserved for adaptation

( , )adp x y( , )trp x y

( , )arg min ( )

tr

tr

p x yf F

f R f

1{( , ) ( , )} ad mm i i iD x y p x y

( , )arg min ( )

ad

ad

p x yf F

f R f

Page 7: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

7

Scenarios

Customization Speech recognition: speaker adaptation Handwriting recognition: writer adaptation Language processing: domain adaptation

Evolutionary environment Spam filter

Page 8: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

8

Practical Work on Adaptation

Gaussian mixture models (GMMs) MAP (Gauvain 94); MLLR (Leggetter 95)

Support vector machines (SVMs) Boosting-like approach (Matic 93) Weighted combination of old support vectors and adaptation

data (Wu 04)

Multi-layer perceptrons (MLPs) Shared “internal representation” in transfer learning (Baxter 95,

Caruana 97, Stadermann 05) Linear input network (Neto 95)

Conditional maximum entropy models Gaussian prior (Chelba 04)

Page 9: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

9

This Work Seeks Answers to …

A unified and principled approach to adaptation Applicable to a variety of classifiers Amenable to variations in the amount of adaptation data

Quantitative relationships between The generalization error bound (or sample complexity bound)

and The divergence between training and target distributions

Page 10: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

10

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Application to the Vocal Joystick Conclusions and future work

Page 11: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

11

Bayesian Fidelity Prior

Adaptation objective

Remp( f ) – empirical error on the adaptation data Pfid( f ) – Bayesian “fidelity prior”

Fidelity prior

How likely a classifier is given a training distribution

rather than a training set – key difference from Baxter 97 Reflects the “fidelity” to the unadapted model Related to the KL-divergence

emp fidmin ( ) ln ( )

f F

R f p f

fid ( , )ln ( ) E [ln ( | , )] trp x y

p f p f x y

Page 12: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

12

Generative Models

Generative models F – a space of generative models p( x, y | f )

Classification

Posterior

Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e.

Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss

( , | ) ( )( | , )

( , | ) ( )

p x y f fp f x y

p x y f f

arg max[log ( , | )]y

p x y f

( , | ) ( , )

( , | ) ( , )

tr tr

ad ad

p x y f p x y

p x y f p x y

standard prior, chosen before training

Page 13: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

13

Fidelity Prior for Generative Models

Key result

where β > 0

Implication Fidelity prior at the desired model

We are more likely to learn our desired model using the fidelity prior than using the standard prior

fidln ( ) ( || ) ln ( ) ad tr ad adp f D p p f

fid( || ) ( ) ( ) tr ad ad adD p p p f f

fidln ( ) ( ( , | ) || ( , | )) ln ( ) trp f D p x y f p x y f f

Page 14: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

14

Instantiations

To compute the fidelity prior Assuming a “uniform” standard prior, this prior is determined by

the KL-divergence We can use an upper bound on the KL-divergence (hence a

lower bound on the prior)

Gaussian models The fidelity prior is a normal-Wishart distribution

Mixture models An upper bound on the KL-divergence (using log sum inequality)

Hidden Markov models An upper bound on the KL-divergence (Silva 06)

Page 15: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

15

Discriminative Models

A unified view of SVMs, MLPs, CRFs and etc. Affine classifiers in a transformed space f = ( w, b ) Classification

Conditional distribution (for binary case)

sgn( ( ) )Tw x b

Φ(x) Loss function

SVMs Determined by the kernel Hinge loss

MLPs Hidden neurons Log loss

CRFs Any feature function Log loss

( ( ) )

1( | , )

1Ty w x b

p y x fe

Page 16: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

16

Discriminative Models (cont.)

Conditional models F – a space of conditional models p( y | x, f )

Classification

Posterior

Assume f tr and f ad are the true models describing the training and target conditional distributions respectively, i.e.

arg max[log ( | , )]y

p y x f

( | , ) ( )( | , )

( | , ) ( )

p y x f fp f x y

p y x f f

( | , ) ( | )

( | , ) ( | )

tr tr

ad ad

p y x f p y x

p y x f p y x

Page 17: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

17

Fidelity Prior for Conditional Models

Again a divergence

where β > 0 What if we do not know ptr(x, y) We seek an upper bound on the KL-divergence and hence a

lower bound on the prior

Key result

where

fidln ( ) ( ( | , ) || ( | , )) ln ( ) trp f D p y x f p y x f f

E[|| ||]R x

( ( | , ) || ( | , )) || || | |tr tr trD p y x f p y x f R w w b b

Page 18: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

18

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Learning and adaptation in the Vocal Joystick Conclusions and future work

Page 19: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

19

Occam’s Razor Bound for Adaptation

(McAllester 98) For a countable function space, for any prior π( f )

Apply for adaptation …

emp

ln ( ) lnPr , ( ) ( ) 1

2

ff F R f R f

m

( ( , | ) || ( , | )) ln ( ) ln

2

trD p x y f p x y f f

m

Page 20: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

20

sample size m

Bound using standard prior

Bounds using fidelity prior

( ( , | ) || ( , | )) ln ( ) ln( , , )

2

trD p x y f p x y f ff m

m

( , , ) f m Increasing KL

Page 21: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

21

PAC-Bayesian Bounds for Adaptation

(McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f )

Choice of prior p( f ) and posterior q( f ) Use pfid( f ) or its related forms as prior

Choose q( f ) to have the same parametric form

Examples Gaussian models Linear classifier

f q(f) f q(f) emp

( ( ) || ( )) ln ln 2Pr , E [ ( )] E [ ( )] 1

2 1

D q f p f mf R f R f

m

1( ( ) || ( )) ( ') ( ') / 2 tr T trD q f p f2( ( ) || ( )) || ' || / 2 trD q f p f w w

Page 22: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

22

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Application the Vocal Joystick Conclusions and future work

Page 23: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

23

Algorithms Derived from the Fidelity Prior

Generative Models

Relate to MAP adaptation

Conditional Models

In particular for log linear models

We focus on SVMs and MLPs

empmin ( ) ( ( , | ) || ( , | ))

tr

f FR f D p x y f p x y f

empmin ( ) ( ( | , ) || ( | , ))

tr

f FR f D p y x f p y x f

2 2emp 1 2min ( ) || || | |

tr tr

f FR f w w b b

Page 24: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

24

Regularized SVM Adaptation

Optimization objective

Globally optimal solution

Regularized – fixing old support vectors and their coefficients Extended regularized – update coefficients of old support

vectors as well, with two separate constraints in the dual space

2

1

1min || ||

2

s.t. 1 ( ( ) )

0

mtr

ii

Ti i i

i

w wm

y w x b

| |

1 1

( ) ( , ) ( , )

SV m

tr tr tri i i i i i

i i

f x y k x x y k x x

Page 25: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

25

Algorithms in Comparison Unadapted Retrained

Use adaptation data only Boosted (Matic 93)

Select adaptation data misclassified by the unadapted model Combine with old support vectors

Bootstrapped (proposed in thesis) Train a seed classifier using adaptation data only Select old support vectors correctly classified by the seed classifier Combine with adaptation data

Regularized and Extended regularized

Page 26: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

26

Regularized MLP Adaptation

Optimization objective for a multi-class, two-layer MLP

Wh2o and Wi2h – the hidden-to-output and input-to-hidden layer weight matrix respectively

||W|| = tr(WTW) Remp( f ) – cross-entropy, corresponding to logistic loss

Locally optimal solution found using back-propagation

2 2

2 22 2 2 2

,min || || || || ( )

2 2h o i h

tr trh o h o i h i h emp

W WW W W W R f

Page 27: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

27

Algorithms in Comparison Unadapted Retrained

Start from randomly initialized weights and train with weight decay Linear input network (Neto 95)

Add a linear transformation in the input space Retrained speaker-independent (Neto 95)

Start from unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05)

Start from unadapted; only train the last layer Retrained first layer (proposed in thesis)

Start from unadapted; only train the first layer Regularized

Many algorithms above can be considered as special cases of regularized

Page 28: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

28

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Application to the Vocal Joystick Conclusions and future work

Page 29: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

29

Experimental Paradigm

Procedure Train an unadapted model on training set Adapt (with supervision) and evaluate via n-fold CV on test set Select regularization coefficients on the dev set

Corpora VJ vowel dataset (Kilanski 06) NORB image dataset (LeCun 04)

Page 30: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

30

VJ Vowel Dataset Task

8 vowel classes Frame-level classification error rate Speaker adaptation

Data allocation Training set – 21 speakers, 420K samples

For SVM, we randomly selected 80K samples for training Test set – 10 speakers, 200 samples Dev set – 4 speakers, 80 samples

Features 182 dimensions – 7 frames of MFCC+delta features

Page 31: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

31

SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev of frame-level error rates over 10 speakers

# adapt. samples per speaker

1K 2K 3K

Unadapted 38.21 ± 4.68 38.21 ± 4.68 38.21 ± 4.68

Retrained 24.70 ± 2.47 18.94 ± 1.52 14.00 ± 1.40

Boosted 29.66 ± 4.60 26.54 ± 2.49 28.85 ± 2.03

Bootstrapped 26.16 ± 5.07 19.24 ± 1.32 14.41 ± 1.26

Regularized 23.28 ± 4.21 19.01 ± 1.30 15.00 ± 1.41

Ext. regularized 28.55 ± 4.99 25.38 ± 2.49 20.36 ± 2.08

red are the best and those not significantly different from the best at p<0.001 level

Page 32: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

32

MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers

# adapt. samples per speaker

1K 2K 3K

Unadapted 32.03 ± 3.76 32.03 ± 3.76 32.03 ± 3.76

Retrained (reg.) 14.21 ± 2.50 10.06 ± 3.15 9.09 ± 3.92

Linear input 13.52 ± 2.22 11.81 ± 2.33 11.96 ± 1.77

Retrained SI 12.15 ± 2.70 9.64 ± 2.74 7.88 ± 2.49

Retrained last 15.45 ± 2.75 13.32 ± 2.46 11.40 ± 2.37

Retrained first 11.56 ± 2.09 9.12 ± 3.08 7.35 ± 2.26

Regularized 11.56 ± 2.09 8.16 ± 2.60 7.30 ± 2.40

Page 33: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

33

05

1015

2025

3035

4045

50

1 2 3 4 5 6 7 8

Num. vowel classes used for adaptation

Err

or

rate

on

all

vow

el c

lass

es

Linear input net

Retrained SI

Retrained last

Retrained first

Regularized

MLP Adaptation (II) Varying number of vowel classes available in adaptation data

Page 34: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

34

NORB Image Dataset

Task 5 object classes Classify each image Lighting condition adaptation

Data allocation Training set – 2700 samples Test set – 2700 samples

Features 32x32 raw images

Page 35: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

35

SVM Adaptation

# adapt. samples per lighting condition

90 180

Unadapted 12.5 ± 2.7 12.5 ± 2.7

Retrained 30.1 ± 3.5 18.9 ± 2.5

Boosted 11.5 ± 2.0 10.7 ± 1.5

Bootstrapped 21.9 ± 4.8 14.9 ± 2.5

Regularized 14.8 ± 1.9 13.4 ± 1.9

Ext. regularized 11.0 ± 1.7 10.4 ± 1.9

RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev of classification error rates over 6 lighting conditions

red are the best and those not significantly different from the best at p<0.001 level

Page 36: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

36

MLP Adaptation

# adapt. samples per lighting condition

90 180

Unadapted 21.4 ± 6.6 21.4 ± 6.6

Retrained (reg) 41.9 ± 15.8 32.5 ± 12.8

Retrained SI 19.6 ± 4.3 18.6 ± 4.3

Regularized 19.2 ± 3.4 18.0 ± 3.7

30 hidden nodes Mean and std. dev over 6 lighting conditions

Page 37: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

37

Roadmap

Introduction Theoretical results

A Bayesian fidelity prior for adaptation Generalization error bounds

Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification

Application to the Vocal Joystick Conclusions and future work

Page 38: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

38

Why the Vocal Joystick

Hands-free computer interfaces Eye-gaze tracking – high cost Head-mouse or chin-joystick – low bandwidth Speech interfaces – inability for continuous control

The Vocal Joystick (Bilmes 05) Can control both continuous and discrete tasks. Utilizes intensity, vowel quality, pitch and discrete sound identity The VJ-mouse control scheme

Joint work with Graduate students: J. Malkin, S. Harada and K. KilanskiFaculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay,

P. Dowden and H. Chizeck

Page 39: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

39

Page 40: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

40

The Pattern Recognition Module

Pitch Tracking

Vowel Classification

Discrete Sound Recognition

Pitch

Vowel posteriors

Discretesound ID

Adaptive Filter

Discrete Sound Detection

Graphical Model

Two-Layer MLP

PhonemeHMMs

Regularized MLP Adaptation

Regularized GMM Adaptation

Vowel Detection

IntensityZero-crossAutocovarainceMFCCs

Page 41: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

41

Vowel Classifier Adaptation

Classifier Two-layer MLP with 4 or 8 outputs Trained on the VJ-vowel dataset

Adaptation Motivation: overlap of different vowel classes articulated from

different speakers Regularized adaptation for MLPs Improvements over unadapted classifier (with 2 secs per vowel):

4-class 7.6%0.2%; 8-class: 32.0%8.2%

Page 42: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

42

Discrete Sound Recognizer Adaptation

Recognizer Phoneme HMMs Trained on TIMIT

Rejection “In-vocabulary”: unvoiced consonants Garbage utterances: vowels, breathing, extraneous speech Heuristics for rejection: duration, zero-crossing, pitch, posteriors

Adaptation Motivation: consonant articulation mismatch between TIMIT data

and the VJ application Regularized adaptation for GMMs Obtain better rejection thresholds

Page 43: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

43

Conclusions and Future Work

Key contributions A Bayesian fidelity prior and generalization error bounds

Regularized adaptation algorithms derived from the fidelity prior

Application to the Vocal Joystick

Future work Theoretical work: adaptation error bounds for classifiers in an

uncountable function space

Algorithmic work: kernel adaptation

The Vocal Joystick: discrete sound data collection and training a better baseline

Page 44: 1 Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

44

Thank You