Machine Learning CS 6830 - webpages.uncc.edu

Machine Learning CS 6830

Razvan C. Bunescu

School of Electrical Engineering and Computer Science

[email protected]

Lecture 02

Supervised Learning

•  Task = learn a function y : X → T that maps input instances x ∈ X to output targets t ∈ T: –  Classification:

•  The output t ∈ T is one of a finite set of discrete categories. –  Regression:

•  The output t ∈ T is continuous, or has a continuous component.

•  Supervision = set of training examples: (x1,t1), (x2,t2), … (xn,tn)

2 Lecture 01

Regression: Curve Fitting

•  Training: examples (x1,t1), (x2,t2), … (xn,tn)

3 Lecture 01

Regression: Curve Fitting

•  Testing: for arbitrary (unseen) instance x ∈ X , compute target output y(x) = t ∈ T .

4 Lecture 01

t = ?

Polynomial Curve Fitting

5 Lecture 01

∑=

=++++==M

j

jj

MM xwxwxwxwwxyxy

0

2210),()( …w

t = ?

parameters features


•  Learning = finding the “right” parameters wT = [w0, w1, … , wM ] –  Find w that minimizes an error function E(w) which measures the

misfit between y(xn,w) and tn. –  Expect that y(x,w) performing well on training examples xn ⇒ y(x,w)

will perform well on arbitrary test examples x∈ X.

•  Sum-of-Squares error function:

6 Lecture 01

∑=

−=N

nnn txyE

1

2}),({21)( ww

Inductive Learning Hyphotesis

why squared?

Sum-of-Squares Error Function

•  How do we find w* that minimizes E(w)?

7 Lecture 01

∑=

−=N

nnn txyE

1

2}),({21)( ww

)(minarg* ww Ew

=


•  Least Square solution is found by solving a set of M + 1 linear equations:

•  Generalization = how well the parameterized y(x,w*) performs on arbitrary (unseen) test instances x∈ X. –  Generalization performance depends on the value of M.

8 Lecture 01

∑ ∑∑= =

+

=

===N

n

N

n

inni

jinijij

M

jij xtTxATwA

1 10

and , where,

0th Order Polynomial

9 Lecture 01

1st Order Polynomial

10 Lecture 01

3rd Order Polynomial

11 Lecture 01

9th Order Polynomial

12 Lecture 01


•  Model Selection: choosing the order M of the polynomial. –  Best generalization obtained with M = 3. –  M = 9 obtains poor generlization, even though it fits training

examples perfectly: •  But M = 9 polynomials subsume M = 3 polynomials!

•  Overfitting ≡ good performance on training examples, poor performance on test examples.

13 Lecture 01

Overfitting

•  Measure fit to training/testing examples using the Root-Mean-Square

(RMS) error:

•  Use 100 random test examples, generated in the same way as the

training examples.

14 Lecture 01

NwEERMS /*)(2=

Over-fitting and Parameter Values

15 Lecture 01

Overfitting vs. Data Set Size

•  More training data ⇒ less overfitting. •  What if we do not have more training data?

–  Use regularization. –  Use a probabilistic model in a Bayesian setting.

16 Lecture 01

M = 9 M = 9

Regularization

•  Penalize large parameter values:

17 Lecture 01

E(w) = 12

{y(xn,w)− tn}2

n=1

N

∑ +λ2w 2

regularizer

)(minarg* ww Ew

=

9th Order Polynomial with Regularization

18 Lecture 01

9th Order Polynomial with Regularization

19 Lecture 01

Training & Test error vs.

20 Lecture 01

How do we find the optimal value of λ?

Model Selection

•  Put aside an independent validation set. •  Select parameters giving best performance on validation set.

21 Lecture 01

Validation Training

}15,20,25,30,35,40{ln −−−−−−∈λ

ln λ -40 -35 -30 -25 -20 -15 ERMS 1.05 0.30 0.25 0.27 0.52 0.55

Model Evaluation

•  K-fold cross-validation –  randomly partition dataset in K equally sized subsets P1, P2, … Pk

–  for each fold i in {1, 2, …, k}: •  test on Pi, train on P1 ∪ … ∪ Pi-1 ∪ Pi+1 ∪ … ∪ Pk

–  compute average error/accuracy across K folds.

22 Lecture 01

4-fold cross validation

Sum-of-Squares Error Function (Revisited)

•  Training objective: minimize sum-of-squares error. •  Why least squares?

23 Lecture 01

∑=

−=N

nnn txyE

1

2}),({21)( ww

∑=

−=N

nnn

T tx1

2})({21

φw

Least Squares <=> Maximum Likelihood (1)

•  Assume observations from a deterministic function with added Gaussian noise:

which is the same as saying:

•  Given observed inputs X = {x1, ..., xN} and targets t = [t1, ..., tN]T, we obtain the likelihood function:

24 Lecture 01

where


•  Taking the logarithm, we get the log-likelihood function:

where

•  ED(w) is the sum-of-squares error!

25 Lecture 01


•  Minimizing square error <=> maximizing likelihood:

•  How do we find w (and β)?

26 Lecture 01

),|(lnmaxarg)(minarg* βwtwww

pE MLD === ww


27 Lecture 01

•  Computing the gradient and setting it to zero yields:

•  Solving for w, we get

where

The Moore-Penrose pseudo-inverse, .


•  Minimizing square error <=> maximizing likelihood:

•  Maximizing with respect to w gives:

•  Maximizing with respect to β gives:

28 Lecture 01

),|(lnmaxarg)(minarg* βwtwww

pE MLD === ww

Regularized Least Square

29 Lecture 01

•  Consider the error function:

•  With the sum-of-squares error function and a quadratic regularizer, we get:

which is minimized by:

Data term + Regularization term

λ is called the regularization coefficient.

Regularized Least Square <=> Maximum A Posteriori (MAP)

•  Define a conjugate prior over w

•  Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives the posterior

where

30 Lecture 01

Regularized Least Square <=> Maximum A Posteriori (MAP)

•  Taking the logarithm of the posterior distribution:

•  The MAP estimate of w is:

31 Lecture 01

ln p(w | t) = − β2

{tnn=1

N

∑ −wTϕ(xn )}2 −

α2wTw+ const

)|(ln maxarg twww

pMAP =

= argmaxw

− 12

{tn −wTϕ(xn )}2

n=1

N

∑ −αβ

2wTw

= argminw

12

{tn −wTϕ(xn )}2

n=1

N

∑ +λ2wTw

)()( minarg www WD EE +=

Regularization & Occam’s Razor

William of Occam (1288 – 1348) •  English Franciscan friar, theologian and philosopher.

•  “Entia non sunt multiplicanda praeter necessitatem”

–  Entities must not be multiplied beyond necessity.

i.e. Do not make things needlessly complicated. i.e. Prefer the simplest hypothesis that fits the data.

32 Lecture 01

Gradient Descent (Batch)

•  Want to minimize a function f : Rn → R. –  f is differentiable and convex. –  compute gradient of f i.e. direction of steepest increase:

–  choose a sequence of points x1, x2, … and a learning rate η such that:

•  Sum-of-squares error:

33 Lecture 01

∑=

−=N

nnn

T txE1

2})({21)( φww

⎥⎦

⎤⎢⎣

⎡=∇ )(),...,(),()(

21

xxxxndxdf

dxdf

dxdff

)(1 τττ η xxx f∇−=+

Gradient Descent

34 Lecture 01

Gradient Descent

35 Lecture 01

Stochastic Gradient Descent (Online)

•  Decompose error function in sum of example errors:

•  Update parameters w after each example, sequentially:

=> the least-mean-square (LMS) algorithm.

36 Lecture 01

∑∑==

=−=N

nn

N

nnn

T EtxE11

2 )(21})({

21)( www φ

)( )()()1( τττ η www nE∇−=+

)())(( )()(nn

Tnt xxww ϕϕη ττ −+=

Regularization: Ridge vs. Lasso

•  Ridge regression:

•  Lasso:

–  If λ is sufficiently large, some of the coefficients wj are driven to 0 => sparse model.

37 Lecture 01

E(w) = 12

{y(xn,w)− tn}2

n=1

N

∑ +λ2

wj2

j=1

M

∑

E(w) = 12

{y(xn,w)− tn}2

n=1

N

∑ +λ2

wjj=1

M

∑

Polynomial Curve Fitting (Revisited)

38 Lecture 01

∑=

=++++==M

j

jj

MM xwxwxwxwwxyxy

0

2210),()( …w

t = ?

parameters features

Generalization: Basis Functions as Features

•  Generally

where ϕj(x) are known as basis functions. •  Typically ϕ0(x) = 1, so that w0 acts as a bias.

•  In the simplest case, use linear basis functions : ϕd(x) = xd.

Linear Basis Function Models (1)

•  Polynomial basis functions:

•  Global behavior: –  a small change in x affect all basis

functions.


•  Gaussian basis functions:

•  Local behavior: –  a small change in x only

affects nearby basis functions. –  µj and s control location and

scale (width).


•  Sigmoidal basis functions:

where

•  Local behavior: –  a small change in x only affect

nearby basis functions. –  µj and s control location and

scale (slope).

Documents

Machine Learning CS 6830 - webpages.uncc.edu