Linear Discriminant Functionsvision.psych.umn.edu/users/schrater/schrater_lab/...6 –The...

Linear Discriminant Functions

Linear discriminant functions anddecision surfaces

• DefinitionIt is a function that is a linear combination of the components of x

g(x) = wtx + w0 (1)where w is the weight vector and w0 the bias

• A two-category classifier with a discriminant function ofthe form (1) uses the following rule:Decide ω1 if

g(x) > 0 and ω2 if g(x) < 0⇔ Decide ω1 if

wtx > -w0 and ω2 otherwiseIf g(x) = 0 ⇒ x is assigned to either class

– The equation g(x) = 0 defines the decisionsurface that separates points assigned to thecategory ω1 from points assigned to thecategory ω2

– When g(x) is linear, the decision surface is ahyperplane

– Algebraic measure of the distance from x tothe hyperplane (interesting result!)

– In conclusion, a linear discriminant function dividesthe feature space by a hyperplane decision surface

– The orientation of the surface is determined by thenormal vector w and the location of the surface isdetermined by the bias

x = xp +rw

since g(xp ) = 0 and wtw = w2

g(x) = wtx + w0 " w

' ( + w0

= g(xp ) + wtw

" r =g(x)

in particular d([0,0],H) =w 0

– The multi-category case

• We define c linear discriminant functions

and assign x to ωi if gi(x) > gj(x) ∀ j ≠ i; in case of ties, the classificationis undefined

• In this case, the classifier is a “linear machine”• A linear machine divides the feature space into c decision regions, with

gi(x) being the largest discriminant if x is in the region Ri

• For a two contiguous regions Ri and Rj; the boundary that separatesthem is a portion of hyperplane Hij defined by:

gi(x) = gj(x)⇔ (wi – wj)tx + (wi0 – wj0) = 0

• wi – wj is normal to Hij and

gi(x) = wi

tx + wi0 i =1,...,c

gg)H,x(d

Generalized Linear Discriminant Functions

• Decision boundaries which separate between classesmay not always be linear

• The complexity of the boundaries may sometimesrequest the use of highly non-linear surfaces

• A popular approach to generalize the concept of lineardecision functions is to consider a generalized decisionfunction as:

g(x) = w1f1(x) + w2f2(x) + … + wNfN(x) + wN+1 (1)

where fi(x), 1 ≤ i ≤ N are scalar functions of thepattern x, x ∈ Rn (Euclidean Space)

• Introducing fn+1(x) = 1 we get:

• This latter representation of g(x) implies that anydecision function defined by equation (1) can be treatedas linear in the (N + 1) dimensional space (N + 1 > n)

• g(x) maintains its non-linearity characteristics in Rn

g(x) = wi fi(x) = wT.y

where w = (w1,w2,...,wN ,wN +1)T

and y = (f1(x), f2(x),..., fN (x), fN +1(x))T

• The most commonly used generalized decisionfunction is g(x) for which fi(x) (1 ≤ i ≤N) arepolynomials

• Quadratic decision functions for a 2-dimensionalfeature space

g(x) = w0

+ wixii=1:N

" + # ijxij= 0

"i=1:N

" x j + $ ijkxij=1:N

"i=1:N

" xk + ....

g(x) = w1x1

2+ w2x1x2 + w3x2

2+ w4x1 + w5x2 + w6

here : w = (w1,w2,...,w6)T

and y = (x1

2,x1x2,x2

2,x1,x2,1)

• For patterns x ∈Rn, the most general quadratic decisionfunction is given by:

• The number of terms at the right-hand side is:

This is the total number of weights which are the freeparameters of the problem– n = 3 => 10-dimensional– n = 10 => 65-dimensional

g(x) = " ii xi2

+ " ij xix j + wixi + wn+1 (2)

#j= i+1

l = N +1

= n +n(n "1)

2+ n +1

=(n +1)(n + 2)

• In the case of polynomial decision functions of order m, atypical fi(x) is given by:

– It is a polynomial with a degree between 0 and m. To avoidrepetitions, i1 ≤ i2 ≤ …≤ im

where gm(x) is the most general polynomial decisionfunction of order m

fi(x) = xi1e1 xi2

e2 ...ximem

where 1" i1,i2,...,im " n and ei,1" i " m is 0 or 1.

gm(x) = ... wi1i2 ...im

xi1 xi2 ...xim + gm"1(x)

im= im"1

#i2= i1

Example 1: Let n = 3 and m = 2 then:

Example 2: Let n = 2 and m = 3 then:

4332211

3333223

22231132112

4332211iiii

wxwxwxw

xwxxwxwxxwxxwx w

wxwxwxwxxw)x(g12

+++++=

++++= !!==

2222112

211222

iiiiii

wxwxwxwxxwx w

)x(gxxw)x(gwhere

)x(gxwxxwxxwx w

)x(gxxxw)x(g

321321

+++++=

• Augmentation

• Incorporate labels into data

• Margin

Learning Linear Classifiers

Basic Ideas

Directly “fit” linear boundary in feature space.• 1) Define an error function for classification

– Number of errors?– Total distance of mislabeled points from boundary?– Least squares error– Within/between class scatter

• Optimize the error function on training data– Gradient descent– Modified gradient descent– Various search proedures.– How much data is needed?

Two-category case• Given x1, x2,…, xn sample points, with true category labels:α1, α2,…,αn

• Decision are made according to:

• Now these decisions are wrong when atxi is negative andbelongs to class ω1.Let yi = αi xi Then yi >0 when correctly labelled,negative otherwise.

if atxi> 0 class "

1 is chosen

if atxi< 0 class "

2 is chosen

"i= #1

if point xi is from class '1

if point xi is from class '2

• Separating vector (Solution vector)– Find a such that atyi>0 for all i.

• atyi=0 defines a hyperplane with a as normal

• Problem: solution not unique• Add Additional constraints: Margin, the distance away from boundary atyi>b

Margins in data space

Larger margins promote uniqueness forunderconstrained problems

Finding a solution vector byminimizing an error criterion

What error function?How do we weight the points? All the points or

only error points?Only error points:

Perceptron Criterion

JP (at ) = "(atyi

$ ) Y = {yi | atyi < 0}

Perceptron Criterion with Margin

JP (at ) = "(atyi

$ ) Y = {yi | atyi < b}

Minimizing Error via GradientDescent

• The min(max) problem:

• But we learned in calculus how to solve thatkind of question!

)(min xfx

Motivation

• Not exactly,• Functions:• High order polynomials:

• What about function that don’t have ananalytic presentation: “Black Box”

! + ! x

120x5 1

5040x7

RRf n!:

:= f ! ( ),x y"

&''cos

Directional Derivatives:first, the one dimension derivative:

Directional Derivatives :Along the Axes…

Directional Derivatives :In general direction…

DirectionalDerivatives

In the plane

RRf !2

fyxf :),(

The Gradient: Definition in

fxxf ,...,:),...,(

RRf n!:

The Gradient: Definition

The Gradient Properties• The gradient defines (hyper) plane

approximating the function infinitesimally

The Gradient properties• Computing directional derivatives• By the chain rule: (important for later use)

fp ,)()( !=

Want rate ofchange along ν, ata point p

The Gradient properties

is maximal choosing

is minimal choosing

(intuition: the gradient points to the greatest change in direction)

v )()(

The Gradient Properties

• Let be a smooth functionaround P,

if f has local minimum (maximum) at pthen,

(Necessary for local min(max))

f :Rn" # " R

("f )p =r 0

Intuition

Formally: for anyWe get: }0{\nRv!

0 =df (p + t " v)

dt= (#f )p,v

$ (#f )p =r 0

The Gradient Properties• We found the best INFINITESIMAL

DIRECTION at each point,• Looking for minimum using “blind man”

procedure• How can get to the minimum using this

knowledge?

Steepest Descent

• AlgorithmInitialization:loop while

compute search directionupdate

end loop

hi ="f (xi)

xi+1 = x

i" # $ h

"f (xi) > #

Setting the Learning RateIntelligently

Minimizing this approximation:

Minimizing Perceptron Criterion

Perceptron Criterion

JP (at ) = "atyi

$ Y = {yi | atyi < 0}

%a JP (at )( ) =%a "atyi

= "yiyi #Y

Which gives the update rule :

a( j +1) = a(k) +&k yiyi #Yk

Where Yk is the misclassified set at step k

Batch Perceptron Update

Simplest Case

When does perceptron ruleconverge?

Different Error Functions Four learning criteriaas a function ofweights in a linearclassifier.

At the upper left is the totalnumber of patternsmisclassified, which ispiecewise constant and henceunacceptable for gradientdescent procedures.

At the upper right is thePerceptron criterion (Eq. 16),which is piecewise linear andacceptable for gradientdescent.

The lower left is squarederror (Eq. 32), which has niceanalytic properties and isuseful even when thepatterns are not linearlyseparable.

The lower right is the squareerror with margin. A designermay adjust the margin b inorder to force the solutionvector to lie toward themiddle of the b = 0 solutionregion in hopes of improvinggeneralization of the resultingclassifier.

Three Different issues

1) Class boundaryexpressiveness

2) Error definition

3) Learning method

Fisher’s linear disc.

Fisher’s Linear DiscriminantIntuition

Write this in terms of w

We are looking for a good projection

Define the scatter matrix:

Thus the denominator can be written:

Rayleigh QuotientMaximizing equivalent tosolving:

With solution:

Linear Discriminant Functionsvision.psych.umn.edu/users/schrater/schrater_lab/...6 –The...

Documents

Sources and shading - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/... · Computer Vision - A Modern Approach Set: Sources, shadows and shading Slides by D.A. Forsyth

PCA/LDA/ICA/KPCA and other acronyms ending in Avision.psych.umn.edu/users/schrater/schrater_lab/courses/PattRecog… · In general, the optimal mapping y=f(x) will be a non-linear

The Quadratic Formula and the Discriminant. The Discriminant

Probabilistic Clustering - University of Minnesotavision.psych.umn.edu › users › schrater › schrater_lab › ...NETLAB Algorithms for Pattern+ kecog Software available on the

PSY 5018H: Math Models Hum Behavior, Prof. Paul Schrater ...vision.psych.umn.edu/users/schrater/schrater_lab/courses/...PSY 5018H: Math Models Hum Behavior, Prof. Paul Schrater, Spring

Controlled Variables: Psychology as the Center …vision.psych.umn.edu/users/schrater/schrater_lab/courses/...CONTROLLED VARIABLES 261 represents the control system itself; the lower

CSCI 5521: Pattern Recognition - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/PattRecog… · Prof. Paul Schrater Pattern Recognition CSCI 5521 4 Syllabus cont’d

Dimensionality Reduction - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/PattRecog... · •Solution to a number of problems in Pattern Recognition can be

Radial Basis Function Networks - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/PattRecog… · Radial Basis Function Networks. Introduction In this lecture we

Kernelized Discriminant Analysis and Adaptive Methods for Discriminant Analysis

Reinforcement Learning - CSci 5512: Artificial Intelligence IIvision.psych.umn.edu/users/schrater/schrater_lab/courses/AI2/rl1.pdfReinforcement Learning (RL) Learning what to do to

Sum-Product Algorithm - University of Minnesotavision.psych.umn.edu/users/schrater/schrater_lab/courses/...Sum-Product Algorithm The overall strategy is simple message passing •

Discriminant Analysis. 18-2 Similarities and Differences between ANOVA, Regression, and Discriminant Analysis ANOVA REGRESSION DISCRIMINANT ANALYSIS Similarities

Incremental Hierarchical Discriminant Regressionweng/research/TNN-IHDR.pdf · Incremental Hierarchical Discriminant Regression ... incremental learning, cortical development, discriminant

What is color? - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/CompVis03/... · Slides by D.A. Forsyth What is color? ... eye are not sufficient or necessary

Segmentation via Generative Models - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/CompVis09/... · CSCI 5561: Computer Vision, Prof. Paul Schrater, Spring 2005

An Introduction to MCMC for Machine Learning - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/AI2/MCMC... · An Introduction to MCMC for Machine Learning CHRISTOPHE

CSCI 5521: Pattern Recognition - University of …vision.psych.umn.edu/users/schrater/schrater_lab/courses/PattRecog...Prof. Paul Schrater Pattern Recognition CSCI 5521 20 Preprocessing

Camera Parameters and Calibration - Vision Labsvision.psych.umn.edu/users/schrater/schrater_lab/courses/... · 2003. 2. 1. · Shape recovery without calibration Fitzgibbons, Zisserman

Filtering 2 - University of Minnesotavision.psych.umn.edu/users/schrater/schrater_lab/courses/CompVis03/ImageFiltering2.pdfFiltering as a Linear Transform h = 1 4 6 4 1 F = F convolved