79
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 1 / 40

Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Gaussian and Linear Discriminant Analysis; MulticlassClassification

Professor Ameet Talwalkar

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 1 / 40

Page 2: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Outline

1 Administration

2 Review of last lecture

3 Generative versus discriminative

4 Multiclass classification

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 2 / 40

Page 3: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Announcements

Homework 2: due on Wednesday

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 3 / 40

Page 4: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Outline

1 Administration

2 Review of last lectureLogistic regression

3 Generative versus discriminative

4 Multiclass classification

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 4 / 40

Page 5: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Logistic classification

Setup for two classes

Input: x ∈ RD

Output: y ∈ {0, 1}Training data: D = {(xn, yn), n = 1, 2, . . . , N}Model of conditional distribution

p(y = 1|x; b,w) = σ[g(x)]

whereg(x) = b+

∑d

wdxd = b+wTx

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 5 / 40

Page 6: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties Derivative is in a simple form

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 6 / 40

Page 7: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties Derivative is in a simple form

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 6 / 40

Page 8: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Why the sigmoid function?

What does it look like?

σ(a) =1

1 + e−a

where

a = b+wTx−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Properties

Bounded between 0 and 1 ← thus, interpretable as probability

Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable

Nice computational properties Derivative is in a simple form

Linear or nonlinear classifier?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 6 / 40

Page 9: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Linear or nonlinear?

σ(a) is nonlinear, however, the decision boundary is determined by

σ(a) = 0.5⇒ a = 0⇒ g(x) = b+wTx = 0

which is a linear function in x

We often call b the offset term.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 7 / 40

Page 10: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Likelihood function

Probability of a single training sample (xn, yn)

p(yn|xn; b;w) =

{σ(b+wTxn) if yn = 11− σ(b+wTxn) otherwise

Compact expression, exploring that yn is either 1 or 0

p(yn|xn; b;w) = σ(b+wTxn)yn [1− σ(b+wTxn)]

1−yn

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 8 / 40

Page 11: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Likelihood function

Probability of a single training sample (xn, yn)

p(yn|xn; b;w) =

{σ(b+wTxn) if yn = 11− σ(b+wTxn) otherwise

Compact expression, exploring that yn is either 1 or 0

p(yn|xn; b;w) = σ(b+wTxn)yn [1− σ(b+wTxn)]

1−yn

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 8 / 40

Page 12: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Maximum likelihood estimation

Cross-entropy error (negative log-likelihood)

E(b,w) = −∑n

{yn log σ(b+wTxn) + (1− yn) log[1− σ(b+wTxn)]}

Numerical optimization

Gradient descent: simple, scalable to large-scale problems

Newton method: fast but not scalable

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 9 / 40

Page 13: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Numerical optimization

Gradient descent

Choose a proper step size η > 0

Iteratively update the parameters following the negative gradient tominimize the error function

w(t+1) ← w(t) − η∑n

{σ(wTxn)− yn

}xn

Remarks

Gradient is direction of steepest ascent.

The step size needs to be chosen carefully to ensure convergence.

The step size can be adaptive (i.e. varying from iteration to iteration).

Variant called stochastic gradient descent (later this quarter).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 10 / 40

Page 14: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Numerical optimization

Gradient descent

Choose a proper step size η > 0

Iteratively update the parameters following the negative gradient tominimize the error function

w(t+1) ← w(t) − η∑n

{σ(wTxn)− yn

}xn

Remarks

Gradient is direction of steepest ascent.

The step size needs to be chosen carefully to ensure convergence.

The step size can be adaptive (i.e. varying from iteration to iteration).

Variant called stochastic gradient descent (later this quarter).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 10 / 40

Page 15: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Intuition for Newton’s method

Approximate the true function with an easy-to-solve optimizationproblem

f(x)fquad

(x)

xk

xk+d

k

f(x)fquad

(x)

xk

xk+d

k

In particular, we can approximate the cross-entropy error function aroundw(t) by a quadratic function (its second order Taylor expansion), and then

minimize this quadratic function

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 11 / 40

Page 16: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Update Rules

Gradient descent

w(t+1) ← w(t) − η∑n

{σ(wTxn)− yn

}xn

Newton method

w(t+1) ← w(t) −H(t)−1∇E(w(t))

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 12 / 40

Page 17: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast gradient descent and Newton’s method

Similar

Both are iterative procedures.

Different

Newton’s method requires second-order derivatives (less scalable, butfaster convergence)

Newton’s method does not have the magic η to be set

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 13 / 40

Page 18: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Outline

1 Administration

2 Review of last lecture

3 Generative versus discriminativeContrast Naive Bayes and logistic regressionGaussian and Linear Discriminant Analysis

4 Multiclass classification

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 14 / 40

Page 19: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Naive Bayes and logistic regression: two differentmodelling paradigms

Consider spam classification problem

First Strategy:I Use training set to find a decision boundary in the feature space that

separates spam and non-spam emailsI Given a test point, predict its label based on which side of the

boundary it is on.

Second Strategy:I Look at spam emails and build a model of what they look like.

Similarly, build a model of what non-spam emails look like.I To classify a new email, match it against both the spam and non-spam

models to see which is the better fit.

First strategy is discriminative (e.g., logistic regression)Second strategy is generative (e.g., naive bayes)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 15 / 40

Page 20: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Naive Bayes and logistic regression: two differentmodelling paradigms

Consider spam classification problem

First Strategy:I Use training set to find a decision boundary in the feature space that

separates spam and non-spam emailsI Given a test point, predict its label based on which side of the

boundary it is on.

Second Strategy:I Look at spam emails and build a model of what they look like.

Similarly, build a model of what non-spam emails look like.I To classify a new email, match it against both the spam and non-spam

models to see which is the better fit.

First strategy is discriminative (e.g., logistic regression)Second strategy is generative (e.g., naive bayes)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 15 / 40

Page 21: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Naive Bayes and logistic regression: two differentmodelling paradigms

Consider spam classification problem

First Strategy:I Use training set to find a decision boundary in the feature space that

separates spam and non-spam emailsI Given a test point, predict its label based on which side of the

boundary it is on.

Second Strategy:I Look at spam emails and build a model of what they look like.

Similarly, build a model of what non-spam emails look like.I To classify a new email, match it against both the spam and non-spam

models to see which is the better fit.

First strategy is discriminative (e.g., logistic regression)Second strategy is generative (e.g., naive bayes)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 15 / 40

Page 22: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Generative vs Discriminative

Discriminative

Requires only specifying a model for the conditional distributionp(y|x), and thus, maximizes the conditional likelihood∑

n log p(yn|xn).Models that try to learn mappings directly from feature space to thelabels are also discriminative, e.g., perceptron, SVMs (covered later)

Generative

Aims to model the joint probability p(x, y) and thus maximize thejoint likelihood

∑n log p(xn, yn).

The generative models we’ll cover do so by modeling p(x|y) and p(y)

Let’s look at two more examples: Gaussian (or Quadratic)Discriminative Analysis and Linear Discriminative Analysis

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 16 / 40

Page 23: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Generative vs Discriminative

Discriminative

Requires only specifying a model for the conditional distributionp(y|x), and thus, maximizes the conditional likelihood∑

n log p(yn|xn).Models that try to learn mappings directly from feature space to thelabels are also discriminative, e.g., perceptron, SVMs (covered later)

Generative

Aims to model the joint probability p(x, y) and thus maximize thejoint likelihood

∑n log p(xn, yn).

The generative models we’ll cover do so by modeling p(x|y) and p(y)

Let’s look at two more examples: Gaussian (or Quadratic)Discriminative Analysis and Linear Discriminative Analysis

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 16 / 40

Page 24: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Generative vs Discriminative

Discriminative

Requires only specifying a model for the conditional distributionp(y|x), and thus, maximizes the conditional likelihood∑

n log p(yn|xn).Models that try to learn mappings directly from feature space to thelabels are also discriminative, e.g., perceptron, SVMs (covered later)

Generative

Aims to model the joint probability p(x, y) and thus maximize thejoint likelihood

∑n log p(xn, yn).

The generative models we’ll cover do so by modeling p(x|y) and p(y)

Let’s look at two more examples: Gaussian (or Quadratic)Discriminative Analysis and Linear Discriminative Analysis

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 16 / 40

Page 25: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Determining sex based on measurements

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=male

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 17 / 40

Page 26: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Generative approach

Model joint distribution of (x = (height, weight), y =sex)

our data

Sex Height Weight1 6′ 1750 5′2” 1201 5′6” 1401 6′2” 2400 5.7” 130· · · · · · · · · 55 60 65 70 75 80

80

100

120

140

160

180

200

220

240

260

280

heightw

eigh

t

red = female, blue=male

Intuition: we will model how heights vary (according to a Gaussian) ineach sub-population (male and female).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 18 / 40

Page 27: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Model of the joint distribution (1D)

p(x, y) = p(y)p(x|y)

=

p0

1√2πσ0

e− (x−µ0)

2

2σ20 if y = 0

p11√2πσ1

e− (x−µ1)

2

2σ21 if y = 1

p0 + p1 = 1 are prior probabilities, andp(x|y) is a class conditional distribution

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=male

What are the parameters to learn?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 19 / 40

Page 28: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Model of the joint distribution (1D)

p(x, y) = p(y)p(x|y)

=

p0

1√2πσ0

e− (x−µ0)

2

2σ20 if y = 0

p11√2πσ1

e− (x−µ1)

2

2σ21 if y = 1

p0 + p1 = 1 are prior probabilities, andp(x|y) is a class conditional distribution

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=male

What are the parameters to learn?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 19 / 40

Page 29: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}

logP (D) =∑n

log p(xn, yn)

=∑

n:yn=0

log

(p0

1√2πσ0

e− (xn−µ0)

2

2σ20

)

+∑

n:yn=1

log

(p1

1√2πσ1

e− (xn−µ1)

2

2σ21

)

Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)

Max likelihood (D = 2) (p∗0, p∗1,µ

∗0,µ

∗1,Σ

∗0,Σ

∗1) = argmax logP (D)

For Naive Bayes we assume Σ∗i is diagonal

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40

Page 30: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}

logP (D) =∑n

log p(xn, yn)

=∑

n:yn=0

log

(p0

1√2πσ0

e− (xn−µ0)

2

2σ20

)

+∑

n:yn=1

log

(p1

1√2πσ1

e− (xn−µ1)

2

2σ21

)

Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)

Max likelihood (D = 2) (p∗0, p∗1,µ

∗0,µ

∗1,Σ

∗0,Σ

∗1) = argmax logP (D)

For Naive Bayes we assume Σ∗i is diagonal

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40

Page 31: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}

logP (D) =∑n

log p(xn, yn)

=∑

n:yn=0

log

(p0

1√2πσ0

e− (xn−µ0)

2

2σ20

)

+∑

n:yn=1

log

(p1

1√2πσ1

e− (xn−µ1)

2

2σ21

)

Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)

Max likelihood (D = 2) (p∗0, p∗1,µ

∗0,µ

∗1,Σ

∗0,Σ

∗1) = argmax logP (D)

For Naive Bayes we assume Σ∗i is diagonal

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40

Page 32: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}

logP (D) =∑n

log p(xn, yn)

=∑

n:yn=0

log

(p0

1√2πσ0

e− (xn−µ0)

2

2σ20

)

+∑

n:yn=1

log

(p1

1√2πσ1

e− (xn−µ1)

2

2σ21

)

Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)

Max likelihood (D = 2) (p∗0, p∗1,µ

∗0,µ

∗1,Σ

∗0,Σ

∗1) = argmax logP (D)

For Naive Bayes we assume Σ∗i is diagonal

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40

Page 33: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Decision boundary

As before, the Bayes optimal one under the assumed jointdistribution depends on

p(y = 1|x) ≥ p(y = 0|x)

which is equivalent to

p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0)

Namely,

− (x− µ1)2

2σ21− log

√2πσ1 + log p1 ≥ −

(x− µ0)2

2σ20− log

√2πσ0 + log p0

⇒ ax2 + bx+ c ≥ 0 ← the decision boundary not linear!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 21 / 40

Page 34: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Decision boundary

As before, the Bayes optimal one under the assumed jointdistribution depends on

p(y = 1|x) ≥ p(y = 0|x)

which is equivalent to

p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0)

Namely,

− (x− µ1)2

2σ21− log

√2πσ1 + log p1 ≥ −

(x− µ0)2

2σ20− log

√2πσ0 + log p0

⇒ ax2 + bx+ c ≥ 0 ← the decision boundary not linear!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 21 / 40

Page 35: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Decision boundary

As before, the Bayes optimal one under the assumed jointdistribution depends on

p(y = 1|x) ≥ p(y = 0|x)

which is equivalent to

p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0)

Namely,

− (x− µ1)2

2σ21− log

√2πσ1 + log p1 ≥ −

(x− µ0)2

2σ20− log

√2πσ0 + log p0

⇒ ax2 + bx+ c ≥ 0 ← the decision boundary not linear!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 21 / 40

Page 36: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Example of nonlinear decision boundary

−2 0 2

−2

0

2

Parabolic Boundary

Note: the boundary is characterized by a quadratic function, giving rise tothe shape of a parabolic curve.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 22 / 40

Page 37: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

A special case: what if we assume the two Gaussians havethe same variance?

−(x− µ1)2

2σ21− log

√2πσ1 + log p1 ≥ −

(x− µ0)2

2σ20− log

√2πσ0 + log p0

with σ0 = σ1

We get a linear decision boundary: bx+ c ≥ 0Note: equal variances across two different categories could be a verystrong assumption.

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=maleFor example, from the plot, it doesseem that the male population hasslightly bigger variance (i.e., biggerellipse) than the female population.So the assumption might not beapplicable.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 23 / 40

Page 38: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

A special case: what if we assume the two Gaussians havethe same variance?

−(x− µ1)2

2σ21− log

√2πσ1 + log p1 ≥ −

(x− µ0)2

2σ20− log

√2πσ0 + log p0

with σ0 = σ1We get a linear decision boundary: bx+ c ≥ 0

Note: equal variances across two different categories could be a verystrong assumption.

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=maleFor example, from the plot, it doesseem that the male population hasslightly bigger variance (i.e., biggerellipse) than the female population.So the assumption might not beapplicable.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 23 / 40

Page 39: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

A special case: what if we assume the two Gaussians havethe same variance?

−(x− µ1)2

2σ21− log

√2πσ1 + log p1 ≥ −

(x− µ0)2

2σ20− log

√2πσ0 + log p0

with σ0 = σ1We get a linear decision boundary: bx+ c ≥ 0Note: equal variances across two different categories could be a verystrong assumption.

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

red = female, blue=maleFor example, from the plot, it doesseem that the male population hasslightly bigger variance (i.e., biggerellipse) than the female population.So the assumption might not beapplicable.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 23 / 40

Page 40: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Mini-summary

Gaussian discriminant analysis

A generative approach, assuming the data modeled by

p(x, y) = p(y)p(x|y)

where p(x|y) is a Gaussian distribution.

Parameters (of Gaussian distributions) estimated by max likelihood

Decision boundary

I In general, nonlinear functions of x (quadratic discriminant analysis)I Linear under various assumptions about Gaussian covariance matrices

F Single arbitrary matrix (linear discriminant analysis)F Multiple diagonal matrices (Gaussian Naive Bayes (GNB))F Single diagonal matrix (GNB in HW2 Problem 1)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 24 / 40

Page 41: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Mini-summary

Gaussian discriminant analysis

A generative approach, assuming the data modeled by

p(x, y) = p(y)p(x|y)

where p(x|y) is a Gaussian distribution.

Parameters (of Gaussian distributions) estimated by max likelihood

Decision boundaryI In general, nonlinear functions of x (quadratic discriminant analysis)I Linear under various assumptions about Gaussian covariance matrices

F Single arbitrary matrix (linear discriminant analysis)F Multiple diagonal matrices (Gaussian Naive Bayes (GNB))F Single diagonal matrix (GNB in HW2 Problem 1)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 24 / 40

Page 42: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Mini-summary

Gaussian discriminant analysis

A generative approach, assuming the data modeled by

p(x, y) = p(y)p(x|y)

where p(x|y) is a Gaussian distribution.

Parameters (of Gaussian distributions) estimated by max likelihood

Decision boundaryI In general, nonlinear functions of x (quadratic discriminant analysis)I Linear under various assumptions about Gaussian covariance matrices

F Single arbitrary matrix (linear discriminant analysis)F Multiple diagonal matrices (Gaussian Naive Bayes (GNB))F Single diagonal matrix (GNB in HW2 Problem 1)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 24 / 40

Page 43: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

So what is the discriminative counterpart?

IntuitionThe decision boundary in Gaussian discriminant analysis is

ax2 + bx+ c = 0

Let us model the conditional distribution analogously

p(y|x) = σ[ax2 + bx+ c] =1

1 + e−(ax2+bx+c)

Or, even simpler, going after the decision boundary of linear discriminantanalysis

p(y|x) = σ[bx+ c]

Both look very similar to logistic regression — i.e. we focus on writingdown the conditional probability, not the joint probability.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 25 / 40

Page 44: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Does this change how we estimate the parameters?

First change: a smaller number of parameters to estimate

Models only parameterized by a, b and c. There are no prior probabilities(p0, p1) or Gaussian distribution parameters (µ0, µ1, σ0 and σ1).

Second change: maximize the conditional likelihood p(y|x)

(a∗, b∗, c∗) = argmin−∑n

{yn log σ(ax

2n + bxn + c) (1)

+ (1− yn) log[1− σ(ax2n + bxn + c)]}

(2)

No closed form solutions!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 26 / 40

Page 45: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Does this change how we estimate the parameters?

First change: a smaller number of parameters to estimate

Models only parameterized by a, b and c. There are no prior probabilities(p0, p1) or Gaussian distribution parameters (µ0, µ1, σ0 and σ1).

Second change: maximize the conditional likelihood p(y|x)

(a∗, b∗, c∗) = argmin−∑n

{yn log σ(ax

2n + bxn + c) (1)

+ (1− yn) log[1− σ(ax2n + bxn + c)]}

(2)

No closed form solutions!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 26 / 40

Page 46: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

How easy for our Gaussian discriminant analysis?

Example

p1 =# of training samples in class 1

# of training samples(3)

µ1 =

∑n:yn=1 xn

# of training samples in class 1(4)

σ21 =

∑n:yn=1(xn − µ1)2

# of training samples in class 1(5)

Note: see textbook for detailed derivation (including generalization tohigher dimensions and multiple classes)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 27 / 40

Page 47: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Generative versus discriminative: which one to use?

There is no fixed rule

Selecting which type of method to use is dataset/task specific

It depends on how well your modeling assumption fits the data

For instance, as we show in HW2, when data follows a specific variantof the Gaussian Naive Bayes assumption, p(y|x) necessarily follows alogistic function. However, the converse is not true.

I Gaussian Naive Bayes makes a stronger assumption than logisticregression

I When data follows this assumption, Gaussian Naive Bayes will likelyyield a model that better fits the data

I But logistic regression is more robust and less sensitive to incorrectmodelling assumption

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 28 / 40

Page 48: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Generative versus discriminative: which one to use?

There is no fixed rule

Selecting which type of method to use is dataset/task specific

It depends on how well your modeling assumption fits the data

For instance, as we show in HW2, when data follows a specific variantof the Gaussian Naive Bayes assumption, p(y|x) necessarily follows alogistic function. However, the converse is not true.

I Gaussian Naive Bayes makes a stronger assumption than logisticregression

I When data follows this assumption, Gaussian Naive Bayes will likelyyield a model that better fits the data

I But logistic regression is more robust and less sensitive to incorrectmodelling assumption

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 28 / 40

Page 49: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Outline

1 Administration

2 Review of last lecture

3 Generative versus discriminative

4 Multiclass classificationUse binary classifiers as building blocksMultinomial logistic regression

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 29 / 40

Page 50: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Setup

Predict multiple classes/outcomes: C1, C2, . . . , CK

Weather prediction: sunny, cloudy, raining, etc

Optical character recognition: 10 digits + 26 characters (lower andupper cases) + special characters, etc

Studied methods

Nearest neighbor classifier

Naive Bayes

Gaussian discriminant analysis

Logistic regression

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 30 / 40

Page 51: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Logistic regression for predicting multiple classes? Easy

The approach of “one versus the rest”

For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)

This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change

(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,

Train K binary classifiers using logistic regression to differentiate thetwo classes

When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40

Page 52: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Logistic regression for predicting multiple classes? Easy

The approach of “one versus the rest”

For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)

This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change

(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,

Train K binary classifiers using logistic regression to differentiate thetwo classes

When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40

Page 53: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Logistic regression for predicting multiple classes? Easy

The approach of “one versus the rest”

For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)

This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change

(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,

Train K binary classifiers using logistic regression to differentiate thetwo classes

When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40

Page 54: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Logistic regression for predicting multiple classes? Easy

The approach of “one versus the rest”

For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)

This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change

(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,

Train K binary classifiers using logistic regression to differentiate thetwo classes

When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40

Page 55: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Yet, another easy approach

The approach of “one versus one”

For each pair of classes Ck and Ck′ , change the problem into binaryclassification

1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data

Ex: for class C1 and C2,

(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .

Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes

When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40

Page 56: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Yet, another easy approach

The approach of “one versus one”

For each pair of classes Ck and Ck′ , change the problem into binaryclassification

1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data

Ex: for class C1 and C2,

(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .

Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes

When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40

Page 57: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Yet, another easy approach

The approach of “one versus one”

For each pair of classes Ck and Ck′ , change the problem into binaryclassification

1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data

Ex: for class C1 and C2,

(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .

Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes

When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40

Page 58: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Yet, another easy approach

The approach of “one versus one”

For each pair of classes Ck and Ck′ , change the problem into binaryclassification

1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data

Ex: for class C1 and C2,

(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .

Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes

When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40

Page 59: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast these two approaches

Pros of each approach

one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.

one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).

I Makes a big difference if you have a lot of data to go through.

Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.

Any other good methods?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40

Page 60: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast these two approaches

Pros of each approach

one versus the rest: only needs to train K classifiers.

I Makes a big difference if you have a lot of classes to go through.

one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).

I Makes a big difference if you have a lot of data to go through.

Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.

Any other good methods?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40

Page 61: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast these two approaches

Pros of each approach

one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.

one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).

I Makes a big difference if you have a lot of data to go through.

Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.

Any other good methods?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40

Page 62: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast these two approaches

Pros of each approach

one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.

one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).

I Makes a big difference if you have a lot of data to go through.

Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.

Any other good methods?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40

Page 63: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast these two approaches

Pros of each approach

one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.

one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).

I Makes a big difference if you have a lot of data to go through.

Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.

Any other good methods?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40

Page 64: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Contrast these two approaches

Pros of each approach

one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.

one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).

I Makes a big difference if you have a lot of data to go through.

Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.

Any other good methods?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40

Page 65: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Multinomial logistic regression

Intuition: from the decision rule of our naive Bayes classifier

y∗ = argmaxk p(y = Ck|x) = argmaxk log p(x|y = Ck)p(y = Ck)

= argmaxk log πk +∑i

zi log θki = argmaxkwTk x

Essentially, we are comparing

wT1 x,w

T2 x, · · · ,wT

Kx

with one for each category.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 34 / 40

Page 66: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Multinomial logistic regression

Intuition: from the decision rule of our naive Bayes classifier

y∗ = argmaxk p(y = Ck|x) = argmaxk log p(x|y = Ck)p(y = Ck)

= argmaxk log πk +∑i

zi log θki = argmaxkwTk x

Essentially, we are comparing

wT1 x,w

T2 x, · · · ,wT

Kx

with one for each category.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 34 / 40

Page 67: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

First try

So, can we define the following conditional model?

p(y = Ck|x) = σ[wTk x]

This would not work because:∑k

p(y = Ck|x) =∑k

σ[wTk x] 6= 1

as each summand can be any number (independently) between 0 and 1.But we are close!We can learn the K linear models jointly to ensure this property holds!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 35 / 40

Page 68: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

First try

So, can we define the following conditional model?

p(y = Ck|x) = σ[wTk x]

This would not work because:∑k

p(y = Ck|x) =∑k

σ[wTk x] 6= 1

as each summand can be any number (independently) between 0 and 1.But we are close!

We can learn the K linear models jointly to ensure this property holds!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 35 / 40

Page 69: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

First try

So, can we define the following conditional model?

p(y = Ck|x) = σ[wTk x]

This would not work because:∑k

p(y = Ck|x) =∑k

σ[wTk x] 6= 1

as each summand can be any number (independently) between 0 and 1.But we are close!We can learn the K linear models jointly to ensure this property holds!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 35 / 40

Page 70: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Definition of multinomial logistic regression

Model

For each class Ck, we have a parameter vector wk and model the posteriorprobability as

p(Ck|x) =ew

Tk x∑

k′ ewTk′x

← This is called softmax function

Decision boundary: assign x with the label that is the maximum ofposterior

argmaxk P (Ck|x)→ argmaxkwTk x

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 36 / 40

Page 71: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Definition of multinomial logistic regression

Model

For each class Ck, we have a parameter vector wk and model the posteriorprobability as

p(Ck|x) =ew

Tk x∑

k′ ewTk′x

← This is called softmax function

Decision boundary: assign x with the label that is the maximum ofposterior

argmaxk P (Ck|x)→ argmaxkwTk x

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 36 / 40

Page 72: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

How does the softmax function behave?

Suppose we have

wT1 x = 100,wT

2 x = 50,wT3 x = −20

We would pick the winning class label 1.

Softmax translates these scores into well-formed conditionalprobababilities

p(y = 1|x) = e100

e100 + e50 + e−20< 1

preserves relative ordering of scores

maps scores to values between 0 and 1 that also sum to 1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 37 / 40

Page 73: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

How does the softmax function behave?

Suppose we have

wT1 x = 100,wT

2 x = 50,wT3 x = −20

We would pick the winning class label 1.

Softmax translates these scores into well-formed conditionalprobababilities

p(y = 1|x) = e100

e100 + e50 + e−20< 1

preserves relative ordering of scores

maps scores to values between 0 and 1 that also sum to 1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 37 / 40

Page 74: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Sanity check

Multinomial model reduce to binary logistic regression when K = 2

p(C1|x) =ew

T1 x

ewT1 x + ew

T2 x

=1

1 + e−(w1−w2)Tx

=1

1 + e−wTx

Multinomial thus generalizes the (binary) logistic regression to deal withmultiple classes.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 38 / 40

Page 75: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Discriminative approach: maximize conditional likelihood

logP (D) =∑n

logP (yn|xn)

We will change yn to yn = [yn1 yn2 · · · ynK ]T, a K-dimensional vectorusing 1-of-K encoding.

ynk =

{1 if yn = k0 otherwise

Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]T.

⇒∑n

logP (yn|xn) =∑n

logK∏k=1

P (Ck|xn)ynk =∑n

∑k

ynk logP (Ck|xn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 39 / 40

Page 76: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Discriminative approach: maximize conditional likelihood

logP (D) =∑n

logP (yn|xn)

We will change yn to yn = [yn1 yn2 · · · ynK ]T, a K-dimensional vectorusing 1-of-K encoding.

ynk =

{1 if yn = k0 otherwise

Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]T.

⇒∑n

logP (yn|xn) =∑n

logK∏k=1

P (Ck|xn)ynk =∑n

∑k

ynk logP (Ck|xn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 39 / 40

Page 77: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Parameter estimation

Discriminative approach: maximize conditional likelihood

logP (D) =∑n

logP (yn|xn)

We will change yn to yn = [yn1 yn2 · · · ynK ]T, a K-dimensional vectorusing 1-of-K encoding.

ynk =

{1 if yn = k0 otherwise

Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]T.

⇒∑n

logP (yn|xn) =∑n

logK∏k=1

P (Ck|xn)ynk =∑n

∑k

ynk logP (Ck|xn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 39 / 40

Page 78: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Cross-entropy error function

Definition: negative log likelihood

E(w1,w2, . . . ,wK) = −∑n

∑k

ynk logP (Ck|xn)

Properties

Convex, therefore unique global optimum

Optimization requires numerical procedures, analogous to those usedfor binary logistic regression

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 40 / 40

Page 79: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

Cross-entropy error function

Definition: negative log likelihood

E(w1,w2, . . . ,wK) = −∑n

∑k

ynk logP (Ck|xn)

Properties

Convex, therefore unique global optimum

Optimization requires numerical procedures, analogous to those usedfor binary logistic regression

Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 40 / 40