Lecture 8 - University of California, Berkeley

EE 2

N.M 8.1

ECTURE ON PATTERN RECOGNITION

PE Spring,1999

25D

ORGAN / B.GOLD LECTURE 8

L

University of CaliforniaBerkeley

College of EngineeringDepartment of Electrical Engineering

and Computer Sciences

rofessors : N.Morgan / B.GoldE225D

Pattern Classification

Lecture 8

EE 2

N.M 8.2


nitionporal sequence

lass labels used

: class labels not

25D


L

Speech Pattern Recog•Soft pattern classification plus tem

integration

•Supervised pattern classification: c

in training

•Unsupervised pattern classification

available or used

EE 2

N.M 8.3


on

1 k K<≤

ωk

25D


L

Feature

Extraction

Pattern

Feature

Vector

Classificati

x1

x2

xd

EE 2

N.M 8.4


assifier

et, compare with

25D


L

•Training: learning parameters of cl

•Testing: classify independent test s

labels and score

EE 2

N.M 8.5

25D

L

EE 2

N.M 8.6

25D

L

EE 2

N.M 8.7


eria

25D


L

Feature Extraction Crit

•Class discrimination

•Generalization

•Parsimony (efficiency)

EE 2

N.M 8.8


ent gains

E

t

25D


L

plosive + vowel energies for 2 differ

t

t)( )

E t( )

EE 2

N.M 8.9

25D

L

t∂∂ CE t( )log

t∂∂ Clog E t( )log+( )=

t∂∂ E t( )log=

EE 2

N.M 8.10


tion on training

tion to test set are

25D


L

Feature Vector Size

•Best representations for discrimina

set are large (highly dimensioned)

•Best representations for generaliza

(typically) succinct)

EE 2

N.M 8.11


tion

L transform,

)

n

25D


L

Dimensionality Reduc

•Principal components (i.e., SVD, K

eigenanalysis ...)

•Linear Discriminant Analysis (LDA

•Application-specific knowledge

•Feature Selection via PR Evaluatio

EE 2

N.M 8.12

25D

L

x x x

x x x

x x

o o

o o

o o

o o

f1

f2

EE 2

N.M 8.13

25D

L

EE 2

N.M 8.14


25D


L

PR Methods

•Minimum Distance

•Discriminant Functions

•Linear Discriminant

•Nonlinear Discriminant

(e.g, quadratic, neural networks)

•Statistical Discriminant Functions

EE 2

N.M 8.15


ent

t closest to new

plicit statistical

mplicates this

25D


L

Minimum Distance•Vector or matrix representing elem

•Define a distance function

•Choose the class of stored elemen

input

•Choice of distance equivalent to im

assumptions

•For speech, temporal variability co

EE 2

N.M 8.16


xTx ziTzi 2xTzi–+( )

i

25D


L

zi template vector (prototype)=

x input vector=

Choose i to minimize distance

argimin x zi–( )T x zi–( ) argimin x zi–( )T x zi–( ) argimin= =

argimaxzi

Tzi 2xTzi–2–

------------------------- argimax xTzi

12---zi

Tz–=

If ziTzi 1 for all i= argimax xTzi( )⇒

EE 2

N.M 8.17


ance

, discrimination)

ace

25D


L

Problems with Min Dist

•Proper scaling of dimensions (size

•For high dim, sparsely sampled sp

EE 2

N.M 8.18


stance

t of infinite

f optimum

potentially large

25D


L

Decision Rule for Min Di

•Nearest Neighbor (NN) - in the limi

samples, at most twice the error o

classifier

•k-Nearest Neighbor (kNN)

•Lots of storage for large problems;

searches

EE 2

N.M 8.19


to reduce its

variance often a

recognition

25D


L

Some Opinions

•Better to throw away bad data than

weight

•Dimensionality-reduction based on

bad choice for supervised pattern

EE 2

N.M 8.20


sect class, min

line, for 3 is

ωωωωTx ωωωω0+ + 0=

25D


L

Discriminant Analysi•Discriminant functions max for corr

for others

•Decision surface between classes

•Linear decision surface for 2-dim is

plane; generally called hyperplane

•For 2 classes, surface at

•2-class quadratic case, surface at

ωωωωTx ωωωω0+ 0=

xTWx

EE 2

N.M 8.21

25D

L

EE 2

N.M 8.22


ctions

25D


L

Training Discriminant Fun

•Minimum distance

•Fisher linear discriminant

•Gradient learning

EE 2

N.M 8.23


- ANNs

25D


L

Generalized Discriminators

•McCulloch Pitts neural model

•Rosenblatt Perceptron

•Multilayer Systems

EE 2

N.M 8.24


erceptron

yo

25D


L

The Perceptron

McCulloch-Pitts Neuron - Rosenblatt P

+

xd

x2

x1

bias

wd

w2

w1

EE 2

N.M 8.25


ncele will converge in a

k)

k)

25D


L

Perceptron ConvergeIf classes are linearly separable the following rufinite number of steps :

For each pattern x at time step k;

if

x k( ) class 1, ωT k( )x k( ) 0≤∈ ω k 1+( ) = ω k( ) cx(+⇒

x k( ) class 2, ωT k( )x k( ) 0≥∈ ω k 1+( ) = ω k( ) cx(–⇒

else

ω k 1+( ) = ""ω k( )

EE 2

N.M 8.26


s :(DAID, 1961)

I/On

25D


L

Multilayer Perceptron•Heterogeneous, “hard” nonlinearity

•Homogeneous, “soft” nonlinearity

(“modern” MLP)

PerceptroGaus. classsubsets

feature

EE 2

N.M 8.27

25D

L

EE 2

N.M 8.28


y

25D


L

f y( )

f y( ) 11 e y–+--------------- (sigmoid)=

0 f y( ) 1<<

EE 2

N.M 8.29

25D

L

EE 2

N.M 8.30


ples: overfitting

25D


L

Some PR Issues

•Testing on the training set

•Training on the test set

•No. parameters vs no. training exam

and overtraining

Documents

Lecture 8 - University of California, Berkeley