Probability and Learning - Machine Learning Unit 4vda.univie.ac.at/Teaching/ML/14s/LectureNotes/04...scipy.optimize.fmin_bfgs It will ﬁnd the best parameters for the logistic regression

Probability and LearningMachine Learning

Unit 4

University of Vienna

28. März 2014

1

Probability and Learning

The Naive Bayes Classifier

Logistic Regression

Linear Discriminant Analysis (LDA)

2

Classification problem

The goal in classification is to take an input vector x and to assign it to one of k discreteclasses C

i

where i = 1, ..., k .In the most common scenario, the classes are taken to be disjoint, so that each input isassigned to one and only one class.The input space is thereby divided into decision regions whose boundaries are calleddecision boundaries or decision surfaces.in probabilistic models we define t

i

as the probability that the class is C

i

:q

k

i=1 t

i

= 1

Three strategies for classification

1 generative models

2 discriminative models

3 discriminant function

3

Generative models

Approaches that explicitly or implicitly model the distribution of inputs as well asoutputs are known as generative models, because by sampling from them it ispossible to generate synthetic data points in the input space.

1 solve the inference problem of determining the class-conditional densities p(x|Ci

)for each class C

i

individually

2 separately infer the prior class probabilities p(Ci

)

3 use Bayes theorem in the form

p(Ci

|x) = p(x|Ci

)p(Ci

)

p(x)

to find the posterior class probabilities p(Ci

|x) The denominator

p(x) =ÿ

i

p(x|Ci

)p(Ci

)

Equivalently, we can model the joint distribution p(x, C

i

) directly and then normalize toobtain the posterior probabilities.

4

Discriminative models

Approaches that model the posterior probabilities directly are called discriminativemodels.

1 solve the inference problem of determining the posterior class probabilities p(Ci

|x)2 subsequently use decision theory to assign each new x to one of the classes

5

Discriminant function

Find a function f(x), called a discriminant function, which maps each input x directlyonto a class label.In the case of two-class problems, f might be binary valued and such that f = 0represents class C1 and f = 1 represents class C2.In this case, probabilities play no role.Example: Fisher’s discriminant, perceptron algorithm.

6

Example

Options: go to the pub, watch TV, go to a party or study. Depends from an assignement,an availability of a party and feeling.

Deadline Party Lazy ActivityUrgent Yes Yes PartyUrgent No Yes StudyNear Yes Yes PartyNone Yes No PartyNone No Yes PubNone Yes No PartyNear No No StudyNear No Yes TVNear Yes Yes Party

Urgent No No Study

7

The Bayes Theorem

There are m = 4 different classes C

i

and n = 10 different examples X

j

.

C1 = Pub, C2 = TV, C3 = Party, C4 = Study

For Deadline we have 3 states:

D1 = Urgent, D2 = Near, D3 = None

For Party we have 2 states:P1 = Yes, P2 = No

For Lazy we have 2 states:L1 = Yes, L2 = No

We calculate the value of P(Ci

) as how many times out of the total the class was C

i

,divide by the total number of examples.

P(Pub) = 0.1, P(TV) = 0.1, P(Party) = 0.5, P(Study) = 0.3

8

The Bayes Theorem

The conditional probaility of C

i

given that x has value X : P(Ci

|X) and tells us how likelyit is that the class is C

i

given the value of x is X .The question is how to get to this conditional probability, since we can not read it directlyfrom table.Direct from table we can read the probabilities P(D

j

|Ci

):P(Urg|Pub)= 0, P(Near|Pub)= 0, P(None|Pub)= 1P(Urg|TV)= 0, P(Near|TV)= 1, P(None|TV)= 0P(Urg|Party)= 0.2, P(Near|Party)= 0.4, P(None|Party)= 0.4P(Urg|Study)= 2/3, P(Near|Study)= 1/3, P(None|Study)= 0

9

The Bayes Theorem

The same procedure we use for the second feature PartyP(Party|Pub) = 0, P(No Party|Pub ) = 1P(Party|TV) = 0, P(No Party|TV ) = 1P(Party|Party) = 1, P(No Party|Party) = 0P(Party|Study) = 0, P(No Party|Study) = 1

and for the third feature LazyP(Lazy|Pub) = 1, P(No Lazy|Pub) = 0P(Lazy|TV) = 1, P(No Lazy|TV) = 0P(Lazy|Party) = 0.6, P(No Lazy|Party) = 0.4P(Lazy|Study) = 1/3, P(No Lazy|Study) = 2/3

10

The Naive Bayes classifier

The simplification:The features are conditionally independent of each others.The Naive Bayes classifier:

feed the values of the features

compute the probabilities of each of the possible classes

pick the most likely class

Suppose:Deadline = Near, Party = No, Lazy = Yes

11

The Naive Bayes classifier

From conditional independence P(BC|A) = P(B|A)P(C|A) follows

P(A|BCD) = P(A)P(BCD|A)P(BCD) = P(A)P(B|A)P(C|A)P(D|A)

P(BCD)It is sufficient to calculate the nominator, because the denominator is for each activity thesame.P(Pub) P(Near|Pub) P(No Part|Pub) P(Lazy|Pub)= 0.1 ◊ 0 ◊ 1 ◊ 1 = 0P(TV) P(Near|TV) P(No Part|TV) P(Lazy|TV)= 0.1 ◊ 1 ◊ 1 ◊ 1 = 0.1P(Party)P(Near|Party) P(No Part|Party) P(Lazy|Party)= 0.5 ◊ 0.4 ◊ 0 ◊ 0.6 = 0P(Study) P(Near|Study) P(No Part|Study) P(Lazy|Study)=0.3 ◊ (1/3) ◊ 1 ◊ 1

3 = 0.033So based on this you will be watching TV tonight.To scale the probability:

P(TV) = 0.10.1+ 1

3= 0.769 P(Study) =

13

0.1+ 13= 0.231

12

Probabilistic Discriminative Models

The second approach is to use the functional form of the generalized linear modelexplicitly and to determine its parameters directly by using maximum likelihood.In the direct approach, we are maximizing a likelihood function defined through theconditional distribution p(C

i

|x), which represents a form of discriminative training.One advantage of the discriminative approach is that there will typically be feweradaptive parameters to be determined.It may also lead to improved predictive performance, particularly when theclass-conditional density assumptions give a poor approximation to the true distributions.

13

Logistic Regression

Classification

Email: Spam/Not Spam?

Online Transaction: Fraudulent (Yes/No)?

Tumor: Malignant /Benign

Two-Class Classification problem: y œ 0, 10: Negative Class (e.g. Benign Tumor) - absence of something 1: Positive Class(e.g. Malignant Tumor)-presence of something

Multiclass Classification problem:y œ 0, 1, 2, 3

14

Logistic Regression

The posterior probability for class C1 can be written as

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

p(C1|x) =1

1 + exp (≠z)= g(z)

with

z = lnp(x|C1)p(C1)

p(x|C2)p(C2)

and

15

Logistic Sigmoid Function

g(z) =1

1 + e

≠z

g(z) - Sigmoid function or Logistic function. The term sigmoid means S-shaped.

16

Logit function

The inverse of the logistic sigmoid is given by

z = lng

1 ≠ g

and is known as the logit function.It represents the log of the ratio of probabilities for the two classes, also known as the logodds.

17

Logistic regression

Linear threshold classifier output h◊(x) = ◊T

x at 0.5:If h◊(x) Ø 0.5, predict y = 1If h◊(x) < 0.5, predict y = 0The use of any regression for the classification problem is not a very good idea:

1 ◊ can be influenced by outliers

2 h◊(x) can be > 1 or < 0

Logistic Regression:want 0 Ø h◊(x) Ø 1Logistic Regression is a classification problem and not a regression problem.Here is a solution:h◊(x) = g(◊T

x)

18

Hypothesis Representation

Task: fix the parameter ◊ to the datah◊(x) =

11+e

≠◊T

x

= estimated probability that y = 1 on input x :

h◊(x) = P(y = 1|x; ◊)Suppose:

predict y = 1 if h◊(x) Ø 0.5

predict y = 0 if h◊(x) < 0.5

h◊(x) = g(◊T

x) Ø 0.5 when ◊T

x Ø 0

h◊(x) = g(◊T

x) < 0.5 when ◊T

x < 0

19

Decision Boundary

predict y = 1 when ◊T

x Ø 0

predict y = 0 when ◊T

x < 0

E.g. h◊(x) = g(◊0 + ◊1x1 + ◊2x2)Then ◊0 + ◊1x1 + ◊2x2 = 0 is a decision boundary.

20

Non-linear decision boundary

h◊(x) = g(◊0 + ◊1x1 + ◊2x2 + ◊3x

21 + ◊4x

22 )

Then ◊0 + ◊1x1 + ◊2x2 + ◊3x

21 + ◊4x

22 = 0 is a decision boundary. E.g. if

◊ = (≠1, 0, 0, 1, 1)T , then decision boundary will be x

21 + x

22 = 1.

More complicated cases:h◊(x) = g(◊0 + ◊1x1 + ◊2x2 + ◊3x

21 + ◊4x

21 x2 + ◊5x

21 x

22 + ◊6x

31 x2 + · · · )

21

Logistic regression cost function

Cost(h◊(x), y) = ≠ log (h◊(x)) if y = 1Cost(h◊(x), y) = ≠ log (1 ≠ h◊(x)) if y = 0Note: y = 0 or 1 always. The combination of two cost functions:

Cost(h◊(x), y) = ≠y log (h◊(x)) ≠ (1 ≠ y) log (1 ≠ h◊(x))

J(◊) = ≠ 1

m

[mÿ

i=1

y

(i) log h◊(x(i)) + (1 ≠ y

(i)) log (1 ≠ h◊(xi))]

22

Cost function and gradient

The gradient of the cost function is a vector ◊ where the j element is defined as follow:

ˆJ(◊)

ˆ◊j

=1

m

mÿ

i=1

(h◊(x(i)) ≠ y

(i))x(i)j

For the calculation of ◊ we use the gradient method:scipy.optimize.fmin_bfgsIt will find the best parameters ◊ for the logistic regression cost function given a fixeddataset (of x and x values). The parameters of scipy.optimize.fmin_bfgs are:

The initial values of the parameters you are trying to optimize;

A function that, when given the training set and a particular ◊, computes thelogistic regression cost and gradient with respect to ◊ for the dataset (x, y).

23

Visualizing the Data

The first two components for the first two groups of iris datas with different markers.

24

Results of logistic regression for iris data

The first two components for the first two groups of iris datas with different markers withthe decision Boundary≠5.672 + 7.726x1 ≠ 11.645x2 = 0 Accuracy: 99.00%

25

Multiclass classification

Email foldering/tagging:Work (y = 1), Friends (y = 2), Family (y = 3), Hobby (y = 4)

Medical diagrams:Not ill (y = 1) , Cold (y = 2) , flu (y = 3)

Weather:Sunny (y = 1) , Cloudy (y = 2), Rain (y = 3), Snow (y = 4)

26

One-vs-all

Class 1: h

(1)◊ (x)

Class 2: h

(2)◊ (x)

Class 3: h

(3)◊ (x)

h

(i)◊ (x) = P(y = i|x; ◊) (1, 2, 3)

h

(i)◊ (x) - estimated probability that y = i on input x

1 ≠ h

(i)◊ (x) - estimated probability that y ”= i on input x i ( one-vs-rest)

27

One-vs-all

Train a logistic regression classifier ◊(i)(x) for each class i to predict the probability thaty = i .On a new input x , to make a prediction, pick the class i that maximizes

maxi

◊(i)(x)

28

Discriminant Function

The third strategy for classification is a discriminant function.Non probabilistic method.A discriminant is a function that takes an input vector x and assigns it to one of m

classes, denoted C

i

.Linear discriminants have hyperplanes as the decision surfaces.If m = 2 a linear discriminant function:

y(x) = w

T

x + w0

w is a weight vector and w0 is a bias.The negative of the bias is sometimes called a threshold.Decision rule:x is assigned to class C1 if y(x) > 0 and to class C2 otherwise.The corresponding decision boundary is defined by y(x) = 0, which corresponds to a(D ≠ 1)-dimensional hyperplane within the D-dimensional input space.

29


c classes of data with means µ1, µ2, · · · , µc

and mean of the entire dataset µCovariance:

C =ÿ

j

(xj

≠ µ)(xj

≠ µ)T

Within-class scatter : S

W

=q

classes c

qjœc

p

c

(xj

≠ µc

)(xj

≠ µc

)T with p

c

theprobability of the class (that is, the number of datapoints there are in that class dividedby the total number)Between-classes scatter: S

B

=q

classes c

qjœc

(µc

≠ µ)(µc

≠ µ)T

C = S

W

+ S

B

30

Fisher’s linear discriminant

31


The datasets are easy to separate into different classes (i.e. the classes arediscriminable) if S

B

/S

W

is large.The projection of the data:

z = w

T · x

We want to make the ration of within-class and between-class scatter w

T

S

W

w

w

T

S

B

w

maximal.

w are the generalised eigenvectors of S

≠1W

S

B

32


Plot of the first two dimention of the iris data showing the three classes before and afterLDA has been applied. Only one dimention (y ) is required for the separation of theclasses after LDA has been applied.

33

Multiple classes

Extension of linear discriminants to m > 2 classes.

1 one-versus-the-rest classifier :the use of m ≠ 1 classifiers each of which solves a two-class problem ofseparating points in a particular class C

i

from points not in that class.

2 one-versus-one classifier

the use of m(m ≠ 1)/2 binary discriminant functions, one for every possible pairof classes. Each point is then classified according to a majority vote amongst thediscriminant functions.

3 single m-class discriminant

the use of m linear functions of the form

y

i

(x) = w

T

i

x + w

i0

and then assigning a point x to class C

i

if y

i

(x) > y

j

(x) for all j ”= i .

34

Documents

Probability and Learning - Machine Learning Unit 4vda.univie.ac.at/Teaching/ML/14s/LectureNotes/04...scipy.optimize.fmin_bfgs It will ﬁnd the best parameters for the logistic regression