33
Linear discriminant functions Andrea Passerini [email protected] Machine Learning Linear discriminant functions

Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Linear discriminant functions

Andrea [email protected]

Machine Learning

Linear discriminant functions

Page 2: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Discriminative learning

Discriminative vs generative

Generative learning assumes knowledge of the distributiongoverning the dataDiscriminative learning focuses on directly modeling thediscriminant functionE.g. for classification, directly modeling decisionboundaries (rather than inferring them from the modelleddata distributions)

Linear discriminant functions

Page 3: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Discriminative learning

PROSWhen data are complex, modeling their distribution can bevery difficultIf data discrimination is the goal, data distribution modelingis not neededFocuses parameters (and thus use of available trainingexamples) on the desired goal

CONSThe learned model is less flexible in its usageIt does not allow to perform arbitrary inference tasksE.g. it is not possible to efficiently generate new data froma certain class

Linear discriminant functions

Page 4: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Linear discriminant functions

Description

f (x) = wT x + w0

The discriminant function is a linear combination ofexample featuresw0 is called bias or thresholdit is the simplest possible discriminant functionDepending on the complexity of the task and amount ofdata, it can be the best option available (at least it is thefirst to try)

Linear discriminant functions

Page 5: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Linear binary classifier

Description

f (x) = sign(wT x + w0)

It is obtained taking the sign of the linear functionThe decision boundary (f (x) = 0) is a hyperplane (H)The weight vector w is orthogonal to the decisionhyperplane:

∀x ,x ′ : f (x) = f (x ′) = 0wT x + w0 −wT x ′ − w0 = 0wT (x − x ′) = 0

Linear discriminant functions

Page 6: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Linear binary classifier

Functional margin

The value f (x) of the function for a certain point x is calledfunctional marginIt can be seen as a confidence in the prediction

Geometric marginThe distance from x to the hyperplane is called geometricmargin

r x =f (x)

||w ||It is a normalize version of the functional marginThe distance from the origin to the hyperplane is:

r0 =f (0)

||w ||=

w0

||w ||

Linear discriminant functions

Page 7: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Linear binary classifier

Geometric margin (cont)A point can be expressed its projection on H plus itsdistance to H times the unit vector in that direction:

x = xp + r x w||w ||

Linear discriminant functions

Page 8: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Linear binary classifier

Geometric margin (cont)

Then as f (xp) = 0:

f (x) = wT x + w0

= wT (xp + r x w||w ||

) + w0

= wT xp + w0︸ ︷︷ ︸f (x p)

+r xwT w||w ||

= r x ||w ||f (x)

||w ||= r x

Linear discriminant functions

Page 9: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Biological motivation

Human BrainComposed of densely interconnected network of neuronsA neuron is made of:

soma A central body containing the nucleusdendrites A set of filaments departing from the body

axon a longer filament (up to 100 times bodydiameter)

synapses connections between dendrites and axonsfrom other neurons

Linear discriminant functions

Page 10: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Biological motivation

Human BrainElectrochemical reactions allow signals to propagate alongneurons via axons, synapses and dendritesSynapses can either excite on inhibit a neuron potentionalOnce a neuron potential exceeds a certain threshold, asignal is generated and transmitted along the axon

Linear discriminant functions

Page 11: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron

Single neuron architecture

f (x) = sign(wT x + w0)

Linear combination of input featuresThreshold activation function

Linear discriminant functions

Page 12: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron

Representational powerLinearly separable sets of examplesE.g. primitive boolean functions (AND,OR,NAND,NOT)⇒ any logic formula can be represented by a network oftwo levels of perceptrons (in disjunctive or conjunctivenormal form).

Problemnon-linearly separable sets of examples cannot beseparatedRepresenting complex logic formulas can require a numberof perceptrons exponential in the size of the input

Linear discriminant functions

Page 13: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Parameter learning

Error minimizationNeed to find a function of the parameters to be optimized(like in maximum likelihood for probability distributions)Reasonable function is measure of error on training set D(i.e. the loss `):

E(w ;D) =∑

(x ,y)∈D

`(y , f (x))

Problem of overfitting training data (less severe for linearclassifier, we will discuss it)

Linear discriminant functions

Page 14: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Parameter learning

Gradient descent1 Initialize w (e.g. w = 0)2 Iterate until gradient is approx. zero:

w = w − η∇E(w ;D)

Noteη is called learning rate and controls the amount ofmovement at each gradient stepThe algorithm is guaranteed to converge to a localoptimum of E(w ;D) (for small enough η)Too low η implies slow convergenceTechniques exist to adaptively modify η

Linear discriminant functions

Page 15: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Parameter learning

ProblemThe misclassification loss is piecewise constantPoor candidate for gradient descent

Perceptron training rule

E(w ;D) =∑

(x ,y)∈DE

−yf (x)

DE is the set of current training errors for which:

yf (x) ≤ 0

The error is the sum of the functional margins of incorrectlyclassified examples

Linear discriminant functions

Page 16: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Parameter learning

Perceptron training ruleWe focus on unbiased (i.e. w0 = 0) perceptron forsimplicity

∇E(w ;D) = ∇∑

(x ,y)∈DE

−yf (x)

= ∇∑

(x ,y)∈DE

−y(wT x)

=∑

(x ,y)∈DE

−yx

the amount of update is:

−η∇E(w ;D) = η∑

(x ,y)∈DE

yx

Linear discriminant functions

Page 17: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron learning

Stochastic perceptron training rule1 Initialize weights randomly2 Iterate until all examples correctly classified:

1 For each training example (x , y) update each weight:

wi ← wi + ηyxi

Note on stochasticwe make a gradient step for each training error (rather thanon the sum of them in batch learning)Each gradient step is very fastStochasticity can sometimes help to avoid local minima,being guided by various gradients for each trainingexample (which won’t have the same local minima ingeneral)

Linear discriminant functions

Page 18: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron learning

Linear discriminant functions

Page 19: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron learning

Linear discriminant functions

Page 20: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron learning

Linear discriminant functions

Page 21: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron learning

Linear discriminant functions

Page 22: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron learning

Linear discriminant functions

Page 23: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron regression

Exact solution

Let X ∈ IRn × IRd be the input training matrix (i.e.X = [x1 · · · xn]T for n = |D| and d = |x |)Let y ∈ IRn be the output training matrix (i.e. yi is output forexample xi )Regression learning could be stated as a set of linearequations):

Xw = y

Giving as solution:

w = X−1y

Linear discriminant functions

Page 24: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron regression

ProblemMatrix X is rectangular, usually more rows than columnsSystem of equations is overdetermined (more equationsthan unknowns)No exact solution typically exists

Linear discriminant functions

Page 25: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron regression

Mean squared error (MSE)Resort to loss minimizationStandard loss for regression is the mean squared error:

E(w ;D) =∑

(x ,y)∈D

(y − f (x))2 = (y − Xw)T (y − Xw)

Closed form solution existsCan be always be solved by gradient descent (can befaster)Can also be used as a classification loss

Linear discriminant functions

Page 26: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron regression

Closed form solution

∇E(w ;D) = ∇∑

(x ,y)∈D

(y −wT x)2

=∑

(x ,y)∈D

2(y −wT x)(−x)

= 2X T (y − Xw) = 0X T y = X T Xw

(X T X )−1X T y = w

Linear discriminant functions

Page 27: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron regression

w = (X T X )−1X T y

Note

(X T X )−1X T is called pseudoinverseIf X is square and nonsingular, inverse and pseudoinversecoincide and the MSE solution corresponds to the exactoneThe pseudoinverse always exists→ approximate solutioncan always be found

Linear discriminant functions

Page 28: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Perceptron regression

Gradient descent

∂E∂wi

=∂

∂wi

12

∑(x ,y)∈D

(y − f (x))2

=12

∑(x ,y)∈D

∂wi(y − f (x))2

=12

∑(x ,y)∈D

2(y − f (x))∂

∂wi(y −wT x)

=∑

(x ,y)∈D

(y − f (x))(−xi)

Linear discriminant functions

Page 29: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Multiclass classification

One-vs-allLearn one binary classifier for each class:

positive examples are examples of the classnegative examples are examples of all other classes

Predict a new example in the class with maximumfunctional marginDecision boundaries for which fi(x) = fj(x) are pieces ofhyperplanes:

wTi x = wT

j x

(w i −w j)T x = 0

Linear discriminant functions

Page 30: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Multiclass classification

Linear discriminant functions

Page 31: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Multiclass classification

all-pairsLearn one binary classifier for each pair of classes:

positive examples from one classnegative examples from the other

Predict a new example in the class winning the largestnumber of pairwise classifications

Linear discriminant functions

Page 32: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Generative linear classifiers

Gaussian distributionslinear decision boundaries are obtained when covarianceis shared among classes (Σi = Σ)

Naive Bayes classifier

fi(x) = P(x |yi)P(yi) =

|x |∏j=1

K∏k=1

θzk (x [j])kyi

|Di ||D|

=K∏

k=1

θNkxkyi

|Di ||D|

where Nkx is the number of times feature k (e.g. a word)appears in x

Linear discriminant functions

Page 33: Linear discriminant functionspasserini/teaching/2011-2012/...Linear discriminant functions Description f(x) = wTx + w0The discriminant function is a linear combination of example features

Generative linear classifiers

Naive Bayes classifier (cont)

log fi(x) =K∑

k=1

Nkx log θkyi︸ ︷︷ ︸w T x ′

+ log(|Di ||D|

)︸ ︷︷ ︸w0

x ′ = [N1x · · ·NKx ]T

w = [log θ1yi · · · log θKyi ]T

Naive Bayes is a log-linear model (as Gaussiandistributions with shared Σ)

Linear discriminant functions