35
1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating hyperplane

1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

Embed Size (px)

Citation preview

Page 1: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

1

Lecture 4

Linear machine

Linear discriminant functions

Generalized linear discriminant function

Fisher’s Linear discriminant

Perceptron

Optimal separating hyperplane

Page 2: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

2

Linear discriminant functions

If w is unit vector, r is signed distance. Decide class by its sign.

g(x)=wtx+w0

Page 3: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

3

Linear discriminant functionsIf x1 and x2 are both on the decision surface,

From the discriminant function point of view:

Page 4: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

4

Linear discriminant functions

More than two classes.

#classes=c

Dichotomize?

c linear discriminants

Pairwise?

c(c-1)/2 linear discriminants

Page 5: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

5

Linear discriminant functions

Remember what we did in the Bayes Decision class?

Define c linear discriminant functions:

The overall classifier will be to maximize g(x) at every x:

if

The resulting classifier is a Linear Machine. The space is divided into c regions.

The boundary between neighboring regions is linear, because:

Page 6: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

6

Linear discriminant functions

Page 7: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

7

Generalized linear discriminant functions

When we transform x, linear discriminant functions can lead to non-linear separation in the original feature space.

Page 8: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

8

Generalized linear discriminant functions

Here in two class case, g(x)=g1(x)-g2(x)

Example:

a’=(-3,2,5)

g(x)=-3+2x+5x2

g(x)=0 when x=3/5 or x=-1

g(x)>0 when x>3/5 or x<-1, decide R1

g(x)<0 when -1<x<3/5, decide R2

Page 9: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

9

Generalized linear discriminant functions

Page 10: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

10

Generalized linear discriminant functions

Page 11: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

11

Fisher Linear discriminant

The goal: project the data from d dimensions onto a line. Find the line that maximizes the class separation after projection.

The magnitude of w is irrelevant, as it just scales y

The direction of w is what matters.

Projected mean:

Page 12: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

12

Fisher Linear discriminant

Then the distance between projected mean:

Our goal is to make the distance large relative to a measure of variation in each class.

Define the scatter:

is an estimate of the pooled variance.

Fisher linear discriminant aims at maximizing over all w

Page 13: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

13

Fisher Linear discriminant

Let

Note, this is the sample version of

Let

Then

Let

Then

Sw: within-class scatter matrix

SB: between-class scatter matrix

Page 14: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

14

Fisher Linear discriminant

Because for any w, SBw is always in the direction of m1-m2

Notice this is the same result when the two densities are normal with equal variance matrix, using the Bayes decision rule.

Page 15: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

15

Multiple discriminant analysis

Now there are c classes. The goal is to project to c-1 dimensional space and maximize the between-group scatter relative to within-group scatter.

Why c-1 ? We need c-1 discriminant functions.

Within-class scatter:

Total mean:

Page 16: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

16

Multiple discriminant analysis

Total scatter

Between group scatter

Take a d×(c-1) projection matrix W:

Page 17: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

17

Multiple discriminant analysis

The goal is to maximize:

The solution: every column vector in W is among the first c-1generalized eigen vectors in

Since the projected scatter is not class-specific, this is more like a dimension reduction procedure which captures as much class information as possible.

Page 18: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

18

Multiple discriminant analysis

Page 19: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

19

Multiple discriminant analysis

Eleven classes. Projected onto the first two eigen vectors:

Page 20: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

20

Multiple discriminant analysis

With the increase of the eigen vector rank, the seperability decreases.

Page 21: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

21

Multiple discriminant analysis

Page 22: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

22

Separating hyperplane

Let’s do some data augmentation to make things easier.

If we have a decision boundary between two classes:

Let

Then

What’s the benefit? The hyperplane always goes through the origin.

Page 23: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

23

Linearly separable case

Now we want to use the training samples to find the weight vector a which classifies all samples correctly.

If a exists, the samples are linearly separable.

for every yi in class 1

for every yi in class 2

If all yi in class 2 are replaced by its negative, then we are trying to find a such that for every sample.

Such an a is a “separating vector” or “solution vector”.

is a hyperplane through the origin of weight space with yi as a normal vector.

The overall solution lies on the positive side of every such hyperplane. Or in the intersection of n half-spaces.

Page 24: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

24

Linearly separable case

Every vector in the grey region is a solution vector. The region is called the “solution region”. A vector in the middle looks better. We can impose conditions to select it.

Page 25: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

25

Linearly separable case

Maximize the minimum distance from the samples to the plane

Page 26: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

26

Gradient descent procedure

How to find a solution vector?

A general approach:

Define a function J(a) which is minimized if a is a solution vector.

Start with an arbitrary vector

Find the gradient

Move from to the direction of the gradient to find

Iterate; stop when the gain is smaller than a threshold.

Page 27: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

27

Gradient descent procedure

Page 28: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

28

Perceptron

Y(a) is the set of samples mis-classified by a.

When Y(a) is empty, define J(a)=0.

Because aty <0 when yi is misclassified, J(a) is non-negative.

The gradient is simple:

The update rule is:

Learning rate

Page 29: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

29

Perceptron

Page 30: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

30

Perceptron

Page 31: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

31

Perceptron

Page 32: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

32

Perceptron

Page 33: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

33

Perceptron

The perceptron adjusts a only according to misclassified samples; correctly classified samples are ignored.

The final a is a linear combination of the training points.

To have good testing-sample performance, a large set of training samples is needed; however, it is almost certain that a large set of training samples is not linearly separable.

In the case of linearly non-separable, the iteration doesn’t stop. We can let η(k) 0 as k∞.

However, how to choose the rate of change?

Page 34: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

34

Optimal separating hyperplane

The perceptron finds a separating plane out of infinite possibilities. How do we find the best among them?

The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point.

•Unique solution

•Better test sample performance

Page 35: 1 Lecture 4 Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating

35

Optimal separating hyperplane

Notation change!!!!

Here we use yi as the class label of sample i.

min ||a||2

s.t. a’yi ≥ 1, i=1,…,N

We shall visit the support vector machine next time.