24
Machine learning: lecture 3 Tommi S. Jaakkola MIT AI Lab

Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Machine learning: lecture 3

Tommi S. Jaakkola

MIT AI Lab

Page 2: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Topics

• Linear regression

– overfitting, cross-validation

• Additive models

– polynomial regression, other basis functions

• Statistical view of regression

– noise model

– likelihood, maximum likelihood estimation

– limitations

Tommi Jaakkola, MIT AI Lab 2

Page 3: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Review: generalization

• The “generalization” error

E(x,y)∼P{

(y − w0 − w1x)2}

is a sum of two terms:

1. error of the best predictor in the class

E(x,y)∼P{

(y − w∗0 − w∗1x)2}

= minw0,w1

E(x,y)∼P{

(y − w0 − w1x)2}

2. and how well we approximate the best linear predictor

based on a limited training set

E(x,y)∼P

{((w∗0 + w∗1x)− (w0 + w1x)

)2}

Tommi Jaakkola, MIT AI Lab 3

Page 4: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Overfitting

• With too few training examples our linear regression model

may achieve zero training error but nevertless has a large

generalization error

−2 −1 0 1 2−6

−4

−2

0

2

4

6

x

x

When the training error no longer bears any relation to the

generalization error the model overfits the data

Tommi Jaakkola, MIT AI Lab 4

Page 5: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Cross-validation

• Cross-validation allows us to estimate generalization error on

the basis of only the training set

For example, the leave-one-out cross-validation error is given

by

CV =1n

n∑i=1

(yi − (w−i0 + w−i1 xi)

)2where (w−i0 , w−i1 ) are least squares estimates computed

without the ith training example.

Tommi Jaakkola, MIT AI Lab 5

Page 6: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Extensions of linear regression: additive models

• Our previous results generalize to models that are linear in

the parameters w, not necessarily in the inputs x1. Simple linear prediction f : R→ R

f(x; w) = w0 + w1x

2. mth order polynomial prediction f : R→ R

f(x; w) = w0 + w1x+ . . .+ wm−1xm−1 + wmx

m

3. Multi-dimensional linear prediction f : Rd→ R

f(x; w) = w0 + w1x1 + . . .+ wd−1xd−1 + wdxd

where x = [x1 . . . xd−1 xd]T , d = dim(x)

Tommi Jaakkola, MIT AI Lab 6

Page 7: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Polynomial regression: example

−2 −1 0 1 2−5

−4

−3

−2

−1

0

1

2

3

4

x

y

−2 −1 0 1 2−6

−4

−2

0

2

4

x

y

degree = 1 degree = 3

−2 −1 0 1 2−6

−4

−2

0

2

4

x

y

−2 −1 0 1 2−6

−4

−2

0

2

4

x

y

degree = 5 degree = 7

Tommi Jaakkola, MIT AI Lab 7

Page 8: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Polynomial regression: example cont’d

−2 −1 0 1 2−5

−4

−3

−2

−1

0

1

2

3

4

x

y

−2 −1 0 1 2−6

−4

−2

0

2

4

x

y

degree = 1, CV = 1.1 degree = 3, CV = 2.6

−2 −1 0 1 2−6

−4

−2

0

2

4

x

y

−2 −1 0 1 2−6

−4

−2

0

2

4

x

y

degree = 5, CV = 44.2 degree = 7, CV = 482.0

Tommi Jaakkola, MIT AI Lab 8

Page 9: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Additive models cont’d

• More generally, predictions are based on a linear combination

of basis functions (features) {φ1(x), . . . , φm(x)}, where each

φi(x) : Rd→ R, and

f(x; w) = w0 + w1φ1(x) + . . .+ wm−1φm−1(x) + wmφm(x)

• For example:

If φi(x) = xi, i = 1, . . . ,m, then

f(x; w) = w0 + w1x+ . . .+ wm−1xm−1 + wmx

m

If m = d, φi(x) = xi, i = 1, . . . , d, then

f(x; w) = w0 + w1x1 + . . .+ wd−1xd−1 + wdxd

Tommi Jaakkola, MIT AI Lab 9

Page 10: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Additive models cont’d

• Example: it is often useful to find “prototypical” input

vectors µ1, . . . , µm that exemplify different “contexts” for

prediction

We can define basis functions (one

for each prototype) that measure

how close the the input vector x is

to the prototype

φk(x) = exp{ −12‖x− µk‖2 }

−2 −1 0 1 2−3

−2

−1

0

1

2

3

Tommi Jaakkola, MIT AI Lab 10

Page 11: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Additive models cont’d

• The basis functions can capture various (e.g., qualitative)

properties of the inputs.

For example: we can try to rate companies based on text

descriptions

x = text document (string of words)

φi(x) ={

1 if word i appears in the document

0 otherwise

f(x; w) = w0 +∑

i∈words

wiφi(x)

Tommi Jaakkola, MIT AI Lab 11

Page 12: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Additive models cont’d

• Graphical representation of additive models (cf. neural

networks):

x1 x2

φ ( )mxφ ( )1 x

. . .

1

1w wm

0w f(x; w)

Tommi Jaakkola, MIT AI Lab 12

Page 13: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Statistical view of linear regression

• A statistical regression model

Observed output = function + noise

y = f(x; w) + ε

where, e.g., ε ∼ N(0, σ2).

• Whatever we cannot capture with our chosen family of

functions will be interpreted as noise

Tommi Jaakkola, MIT AI Lab 13

Page 14: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Statistical view of linear regression

• Our function f(x; w) here is trying to capture the mean of

the observations y given a specific input x:

E{ y |x } = f(x; w)

The expectation is taken with respect to P that governs the

underlying (and typically unknown) relation between x and

y.

−2 −1 0 1 2−5

0

5

Tommi Jaakkola, MIT AI Lab 14

Page 15: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Statistical view of linear regression

• According to our statistical model

y = f(x; w) + ε, ε ∼ N(0, σ2)

the outputs y given x are normally distributed with mean

f(x; w) and variance σ2:

P (y|x,w, σ2) =1√

2πσ2exp{− 1

2σ2(y − f(x; w))2 }

• As a result we can also measure the uncertainty in the

predictions (through variance σ2), not just the mean

• Loss function? Estimation?

Tommi Jaakkola, MIT AI Lab 15

Page 16: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Maximum likelihood estimation

• Given observations D = {(x1, y1), . . . , (xn, yn)} we find the

parameters w that maximize the likelihood of the observed

outputs

L(D; w, σ2) =n∏i=1

P (yi|xi,w, σ2)

−2 −1 0 1 2−6

−4

−2

0

2

4

6

Why is this a bad fit according to the likelihood criterion?

Tommi Jaakkola, MIT AI Lab 16

Page 17: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Maximum likelihood estimation

Likelihood of the observed outputs:

L(D; w, σ2) =n∏i=1

P (yi|xi,w, σ2)

• It is often easier (and equivalent) to try to maximize the

log-likelihood:

l(D; w, σ2) = logL(D; w, σ2) =n∑i=1

logP (yi|xi,w, σ2)

=n∑i=1

(− 1

2σ2(yi − f(xi; w))2 − log

√2πσ2

)

=(− 1

2σ2

) n∑i=1

(yi − f(xi; w))2 − n2

log(2πσ2)

Tommi Jaakkola, MIT AI Lab 17

Page 18: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Maximum likelihood estimation cont’d

• The noise distribution and the loss-function are intricately

related

Loss(y, f(x; w)) = − logP (y|x,w, σ2) + const.

Tommi Jaakkola, MIT AI Lab 18

Page 19: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Maximum likelihood estimation cont’d

• The likelihood of the observed outputs

L(D; w, σ2) =n∏i=1

P (yi|xi,w, σ2)

provides a general measure of how the model fits the data.

On the basis of this measure, we can estimate the noise

variance σ2 as well as the weights w.

Can we find a rationale for what the “optimal” noise variance

should be?

Tommi Jaakkola, MIT AI Lab 19

Page 20: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Maximum likelihood estimation cont’d

• To estimate the parameters w and σ2 quantitatively, we

maximize the log-likelihood with respect to all the parameters

∂wl(D; w, σ2) = 0

∂σ2l(D; w, σ2) = 0

The resulting noise variance σ2 is given by

σ2 =1n

n∑i=1

(yi − f(xi; w))2

where w is the same ML estimate of w as before.

Interpretation: this is the mean squared prediction error (on

the training set) of the best linear predictor.

Tommi Jaakkola, MIT AI Lab 20

Page 21: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Brief derivation

Consider the log-likelihood evaluated at w

l(D; w, σ2) =(− 1

2σ2

) n∑i=1

(yi − f(xi; w))2 − n2

log(2πσ2)

(need to justify first that we can simply substitute in the ML

solution w rather than perform joint maximization)

Now,

∂σ2l(D; w, σ2) =

(1

2σ4

) n∑i=1

(yi − f(xi; w))2 − n

2σ2= 0

and we get the solution by multiplying both sides by 2σ4/n.

Tommi Jaakkola, MIT AI Lab 21

Page 22: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Cross-validation and log-likelihood

Leave-one-out cross-validated log-likelihood:

CV =n∑i=1

logP (yi|xi, w−i, (σ2)−i)

where w−i and (σ2)−i are maximum likelihood estimates

computed without the ith training example (xi, yi).

Tommi Jaakkola, MIT AI Lab 22

Page 23: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Some limitations

• The simple statistical model

y = f(x; w) + ε, ε ∼ N(0, σ2)

is not always appropriate or useful.

Example: noise may not be Gaussian

−2 −1 0 1 2−5

0

5

−5 0 50

0.1

0.2

0.3

0.4

0.5

Tommi Jaakkola, MIT AI Lab 23

Page 24: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/bitstream/handle/1721.1/46320/6-867Fall-2002/NR/rdon... · Tommi Jaakkola, MIT AI Lab 8. Additive models cont’d More generally, predictions

Limitations cont’d

• It may not even be possible (or at all useful) to model the

data with

y = f(x; w) + ε, ε ∼ N(0, σ2)

no matter how flexible the function class f(·; w), w ∈ W is.

Example:

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

x

y

(note: this is NOT a limitation conditional models P (y|x,w)more generally)

Tommi Jaakkola, MIT AI Lab 24