70
Advanced Introduction to Machine Learning 10715, Fall 2014 Linear Regression and Sparsity Eric Xing Lecture 2, September 10, 2014 Reading: © Eric Xing @ CMU, 2014 1

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Advanced Introduction to Machine Learning

10715, Fall 2014

Linear Regression and Sparsity

Eric XingLecture 2, September 10, 2014

Reading:© Eric Xing @ CMU, 2014 1

Page 2: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Machine learning for apartment hunting

Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs:

square-ft., # of bedroom, distance to campus …

Living area (ft2) # bedroom Rent ($)

230 1 600506 2 1000433 2 1100109 1 500…150 1 ?270 1.5 ?

© Eric Xing @ CMU, 2014 2

Page 3: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

The learning problem

Features: Living area, distance to campus, #

bedroom … Denote as x=[x1, x2, … xk]

Target: Rent Denoted as y

Training set:

rent

Living area

rent

Living area

Location

knnn

k

k

n xxx

xxxxxx

21

222

12

121

11

2

1

x

xx

X

nny

yy

y

yy

Y

2

1

2

1

or

© Eric Xing @ CMU, 2014 3

Page 4: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Linear Regression Assume that Y (target) is a linear function of X (features):

e.g.:

let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:

then we have the following general representation of the linear function:

Our goal is to pick the optimal . How! We seek that minimize the following cost function:

22

110ˆ xxy

n

iiii yxyJ

1

2

21 ))(ˆ()(

© Eric Xing @ CMU, 2014 4

Page 5: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

The Least-Mean-Square (LMS) method The Cost Function:

Consider a gradient descent algorithm:

n

ii

Ti yJ

1

2

21 )()( x

tj

tj

tj J )(

1

© Eric Xing @ CMU, 2014 5

Page 6: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

The Least-Mean-Square (LMS) method Now we have the following descent rule:

For a single training point, we have:

This is known as the LMS update rule, or the Widrow-Hoff learning rule This is actually a "stochastic", "coordinate" descent algorithm This can be used as a on-line algorithm

n

i

ji

tTii

tj

tj xy

1

1 )( x

© Eric Xing @ CMU, 2014 6

Page 7: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Geometry and Convergence of LMS

N=1 N=2 N=3

Claim: when the step size satisfies certain condition, and when certain other technical conditions are satisfied, LMS will converge to an “optimal region”.

itT

iitt y xx )(1

© Eric Xing @ CMU, 2014 7

Page 8: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Steepest Descent and LMS Steepest descent

Note that:

This is as a batch gradient descent algorithm

n

in

Tnn

T

k

yJJJ11

xx )(,,

n

in

tTnn

tt y1

1 xx )(

© Eric Xing @ CMU, 2014 8

Page 9: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

The normal equations Write the cost function in matrix form:

To minimize J(θ), take derivative and set to zero:

yyXyyXXX

yXyX

yJ

TTTTTT

T

n

ii

Ti

212121

1

2)()( x

nx

xx

X

2

1

ny

yy

y

2

1

0

221

22121

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

trtrtr

tryXXX TT

The normal equations

yXXX TT 1*

© Eric Xing @ CMU, 2014 9

Page 10: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Some matrix derivatives For , define:

Trace:

Some fact of matrix derivatives (without proof)

fA

fA

fA

fA

Af

mnm

n

A

1

111

)(

RR nmf :

, tr

n

iiiAA

1

, tr aa BCACABABC trtrtr

, tr TA BAB , tr TTT

A ABCCABCABA TA AAA 1

© Eric Xing @ CMU, 2014 10

Page 11: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Comments on the normal equation In most situations of practical interest, the number of data

points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.

The assumption that XTX is invertible implies that it is positive definite, thus at the critical point we have found is a minimum.

What if X has less than full column rank? regularization (later).

© Eric Xing @ CMU, 2014 11

Page 12: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Direct and Iterative methods Direct methods: we can achieve the solution in a single step

by solving the normal equation Using Gaussian elimination or QR decomposition, we converge in a finite number

of steps It can be infeasible when data are streaming in in real time, or of very large

amount

Iterative methods: stochastic or steepest gradient Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate

© Eric Xing @ CMU, 2014 12

Page 13: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Convergence rate Theorem: the steepest descent equation algorithm

converge to the minimum of the cost characterized by normal equation:

If

A formal analysis of LMS need more math-mussels; in practice, one can use a small , or gradually decrease .

© Eric Xing @ CMU, 2014 13

Page 14: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

A Summary: LMS update rule

Pros: on-line, low per-step cost, fast convergence and perhaps less prone to local optimum

Cons: convergence to optimum not always guaranteed

Steepest descent

Pros: easy to implement, conceptually clean, guaranteed convergence Cons: batch, often slow converging

Normal equations

Pros: a single-shot algorithm! Easiest to implement. Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues

(e.g., matrix is singular ..), although there are ways to get around this …

intT

nnt

jt

j xy ,)( x1

n

in

tTnn

tt y1

1 xx )(

yXXX TT 1*

© Eric Xing @ CMU, 2014 14

Page 15: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Geometric Interpretation of LMS The predictions on the training data are:

Note that

and

is the orthogonal projection ofinto the space spanned by the columns of X

yXXXXXy TT 1 *ˆ

yIXXXXyy TT 1ˆ

0

1

1

yXXXXXX

yIXXXXXyyXTTTT

TTTT

!!y y

nx

xx

X

2

1

ny

yy

y

2

1

© Eric Xing @ CMU, 2014 15

Page 16: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are

related by the equation:

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

By independence assumption:

iiT

iy x

2

2

221

)(exp);|( i

Ti

iiyxyp x

2

12

1 221

n

i iT

inn

iii

yxypL

)(exp);|()(

x

© Eric Xing @ CMU, 2014 16

Page 17: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Probabilistic Interpretation of LMS, cont. Hence the log-likelihood is:

Do you recognize the last term?

Yes it is:

Thus under independence assumption, LMS is equivalent to MLE of θ !

n

i iT

iynl1

22 211

21 )(log)( x

n

ii

Ti yJ

1

2

21 )()( x

© Eric Xing @ CMU, 2014 17

Page 18: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Case study: predicting gene expression

The genetic picture

CGTTTCACTGTACAATTTcausal SNPs

a univariate phenotype:

i.e., the expression intensity of a gene

© Eric Xing @ CMU, 2014 18

Page 19: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Individual 1

Individual 2

Individual N

Phenotype (BMI)

2.5

4.8

4.7

Genotype

. . C . . . . . T . . C . . . . . . . T . . .

. . C . . . . . A . . C . . . . . . . T . . .

. . G . . . . . A . . G . . . . . . . A . . .

. . C . . . . . T . . C . . . . . . . T . . .

. . G . . . . . T . . C . . . . . . . T . . .

. . G . . . . . T . . G . . . . . . . T . . .

Causal SNPBenign SNPs

…Association Mapping as Regression

© Eric Xing @ CMU, 2014 19

Page 20: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Individual 1

Individual 2

Individual N

Phenotype (BMI)

2.5

4.8

4.7

Genotype

. . 0 . . . . . 1 . . 0 . . . . . . . 0 . . .

. . 1 . . . . . 1 . . 1 . . . . . . . 1 . . .

. . 2 . . . . . 2 . . 1 . . . . . . . 0 . . .

yi =

J

jjijx

1 SNPs with large

|βj| are relevant

Association Mapping as Regression

© Eric Xing @ CMU, 2014 20

Page 21: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Experimental setup Asthama dataset

543 individuals, genotyped at 34 SNPs Diploid data was transformed into 0/1 (for homozygotes) or 2 (for heterozygotes) X=543x34 matrix Y=Phenotype variable (continuous)

A single phenotype was used for regression

Implementation details Iterative methods: Batch update and online update implemented. For both methods, step size α is chosen to be a small fixed value (10-6). This

choice is based on the data used for experiments. Both methods are only run to a maximum of 2000 epochs or until the change in

training MSE is less than 10-4

© Eric Xing @ CMU, 2014 21

Page 22: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Convergence Curves

For the batch method, the training MSE is initially large due to uninformed initialization

In the online update, N updates for every epoch reduces MSE to a much smaller value.

© Eric Xing @ CMU, 2014

22

Page 23: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

The Learned Coefficients

© Eric Xing @ CMU, 2014 23

Page 24: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

T G

A A

C C

A T

G A

A G

T A

GenotypeTrait

Multivariate Regression for Trait Association Analysis

Xy

2.1 x=

x= β

Association Strength

?

© Eric Xing @ CMU, 2014 24

Page 25: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

T G

A A

C C

A T

G A

A G

T A

GenotypeTrait

Multivariate Regression for Trait Association Analysis

Many non-zero associations: Which SNPs are truly significant?

2.1 x=

Association Strength

© Eric Xing @ CMU, 2014 25

Page 26: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Sparsity One common assumption to make sparsity.

Makes biological sense: each phenotype is likely to be associated with a small number of SNPs, rather than all the SNPs.

Makes statistical sense: Learning is now feasible in high dimensions with small sample size

© Eric Xing @ CMU, 2014 26

Page 27: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Sparsity: In a mathematical sense Consider least squares linear regression problem: Sparsity means most of the beta’s are zero.

But this is not convex!!! Many local optima, computationally intractable.

y

x1

x2

x3

xn-1

xn

1

2

3

n-1

n

© Eric Xing @ CMU, 2014 27

Page 28: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

L1 Regularization (LASSO)(Tibshirani, 1996)

A convex relaxation.

Still enforces sparsity!

Constrained Form Lagrangian Form

© Eric Xing @ CMU, 2014 28

Page 29: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Theoretical Guarantees Assumptions

Dependency Condition: Relevant Covariates are not overly dependent Incoherence Condition: Large number of irrelevant covariates cannot be too

correlated with relevant covariates Strong concentration bounds: Sample quantities converge to expected values

quickly

If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.

© Eric Xing @ CMU, 2014 29

Page 30: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Consistent Structure Recovery[Zhao and Yu 2006]

© Eric Xing @ CMU, 2014 30

Page 31: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Genotype

x=2.1

Trait

Lasso for Reducing False Positives

Many zero associations (sparse results), but what if there are multiple related traits?

+

J

j 1 |βj |

T G

A A

C C

A T

G A

A G

T A

Lasso Penalty for sparsity

Association Strength

© Eric Xing @ CMU, 2014 31

Page 32: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Ridge Regression vs Lasso

Ridge Regression:  Lasso: HOT!

Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinatesGood for high‐dimensional problems – don’t have to store all coordinates!

βs with constant l1 norm

βs with constant J(β)(level sets of J(β))

βs with constant l2 norm

β2

β1

X X

© Eric Xing @ CMU, 2014 32

Page 33: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bayesian Interpretation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:

This is Bayes Rule

likelihood marginalpriorlikelihoodposterior

dpDppDp

DppDpDp

)()|()()|(

)()()|()|(

The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2014 33

Page 34: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

What if (XTX) is not invertible ? 

log likelihood log prior

Prior belief that β is Gaussian with zero‐mean biases solution to “small” β

I) Gaussian Prior

0

Ridge Regression

Closed form: HW

Regularized Least Squares and MAP

© Eric Xing @ CMU, 2014 34

Page 35: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Regularized Least Squares and MAP

log likelihood log prior

Prior belief that β is Laplace with zero‐mean biases solution to “small” β

Lasso

Closed form: HW

II) Laplace Prior

What if (XTX) is not invertible ? 

© Eric Xing @ CMU, 2014 35

Page 36: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Take home message Gradient descent

On-line Batch

Normal equations Geometric interpretation of LMS Probabilistic interpretation of LMS, and equivalence of LMS and

MLE under certain assumption (what?) Sparsity:

Approach: ridge vs. lasso regression Interpretation: regularized regression versus Bayesian regression Algorithm: convex optimization (we did not discuss this)

LR does not mean fitting linear relations, but linear combination or basis functions (that can be non-linear)

Weighting points by importance versus by fitness

© Eric Xing @ CMU, 2014 36

Page 37: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

After class material …

© Eric Xing @ CMU, 2014 37

Page 38: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Advanced Material: Beyond basic LR LR with non-linear basis functions

Locally weighted linear regression

Regression trees and Multilinear Interpolation

We will discuss this in next class after we set the state right!(if we’ve got time )

© Eric Xing @ CMU, 2014 38

Page 39: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

LR with non-linear basis functions LR does not mean we can only deal with linear relationships

We are free to design (non-linear) features under LR

where the j(x) are fixed basis functions (and we define 0(x) = 1).

Example: polynomial regression:

We will be concerned with estimating (distributions over) the weights θ and choosing the model order M.

)()( xxy Tm

j j 10

321 xxxx ,,,:)(

© Eric Xing @ CMU, 2014 39

Page 40: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Basis functions There are many basis functions, e.g.:

Polynomial

Radial basis functions

Sigmoidal

Splines, Fourier, Wavelets, etc

1 jj xx)(

2

2

2sx

x jj

exp)(

sx

x jj

)(

© Eric Xing @ CMU, 2014 40

Page 41: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

1D and 2D RBFs 1D RBF

After fit:

© Eric Xing @ CMU, 2014 41

Page 42: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Good and Bad RBFs A good 2D RBF

Two bad 2D RBFs

© Eric Xing @ CMU, 2014 42

Page 43: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Overfitting and underfitting

xy 10 2210 xxy

5

0jj

j xy

© Eric Xing @ CMU, 2014 43

Page 44: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bias and variance We define the bias of a model to be the expected

generalization error even if we were to fit it to a very (say, infinitely) large training set.

By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.

© Eric Xing @ CMU, 2014 44

Page 45: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Locally weighted linear regression

The algorithm:Instead of minimizing

now we fit θ to minimize

Where do wi's come from?

where x is the query point for which we'd like to know its corresponding y

Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)

n

ii

Ti yJ

1

2

21 )()( x

n

ii

Tii ywJ

1

2

21 )()( x

2

2

2)(exp xx i

iw

© Eric Xing @ CMU, 2014 45

Page 46: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Parametric vs. non-parametric Locally weighted linear regression is the second example we

are running into of a non-parametric algorithm. (what is the first?)

The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm because it has a fixed, finite number of parameters (the θ), which are fit to the

data; Once we've fit the θ and stored them away, we no longer need to keep the

training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need

to keep the entire training set around.

The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.

© Eric Xing @ CMU, 2014 46

Page 47: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Robust Regression

The best fit from a quadratic regression

But this is probably better …

How can we do this?© Eric Xing @ CMU, 2014 47

Page 48: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

LOESS-based Robust Regression Remember what we do in "locally weighted linear regression"? we "score" each point for its impotence

Now we score each point according to its "fitness"

(Courtesy to Andrew Moor) © Eric Xing @ CMU, 2014 48

Page 49: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Robust regression For k = 1 to R…

Let (xk ,yk) be the kth datapoint Let yest

k be predicted value of yk

Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:

Then redo the regression using weighted data points.

Repeat whole thing until converged!

2)( estkkk yyw

© Eric Xing @ CMU, 2014 49

Page 50: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Robust regression—probabilistic interpretation What regular regression does:

Assume yk was originally generated using the following recipe:

Computational task is to find the Maximum Likelihood estimation of θ

),( 20 N kT

ky x

© Eric Xing @ CMU, 2014 50

Page 51: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Robust regression—probabilistic interpretation What LOESS robust regression does:

Assume yk was originally generated using the following recipe:

with probability p:

but otherwise

Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σhuge.

The algorithm you saw with iterative reweighting/refittingdoes this computation for us. Later you will find that it is an instance of the famous E.M. algorithm

),( 20 N kT

ky x

),(~ huge2Nky

© Eric Xing @ CMU, 2014 51

Page 52: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Regression Tree Decision tree for regression

Gender Rich? Num. Children

# travel per yr.

Age

F No 2 5 38

M No 0 2 25

M Yes 1 0 72

: : : : :

Gender?

Predicted age=39 Predicted age=36

Female Male

© Eric Xing @ CMU, 2014 52

Page 53: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

A conceptual picture Assuming regular regression trees, can you sketch a graph of

the fitted function y*(x) over this diagram?

© Eric Xing @ CMU, 2014 53

Page 54: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

How about this one? Multilinear Interpolation

We wanted to create a continuous and piecewise linear fit to the data

© Eric Xing @ CMU, 2014 54

Page 55: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Take home message Gradient descent

On-line Batch

Normal equations Geometric interpretation of LMS Probabilistic interpretation of LMS, and equivalence of LMS and

MLE under certain assumption (what?) Sparsity:

Approach: ridge vs. lasso regression Interpretation: regularized regression versus Bayesian regression Algorithm: convex optimization (we did not discuss this)

LR does not mean fitting linear relations, but linear combination or basis functions (that can be non-linear)

Weighting points by importance versus by fitness

© Eric Xing @ CMU, 2014 55

Page 56: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Appendix

© Eric Xing @ CMU, 2014 56

Page 57: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Parameter Learning from iid Data Goal: estimate distribution parameters from a dataset of N

independent, identically distributed (iid), fully observed, training cases

D = {x1, . . . , xN}

Maximum likelihood estimation (MLE)1. One of the most common estimators2. With iid and full-observability assumption, write L() as the likelihood of the data:

3. pick the setting of parameters most likely to have generated the data we saw:

);,,()( , NxxxPL 21

N

i i

N

xP

xPxPxP

1

2

);(

);(,),;();(

)(maxarg*

L )(logmaxarg

L© Eric Xing @ CMU, 2014 57

Page 58: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Example: Bernoulli model Data:

We observed N iid coin tossing: D={1, 0, 1, …, 0}

Representation:Binary r.v:

Model:

How to write the likelihood of a single observation xi ?

The likelihood of datasetD={x1, …,xN}:

ii xxixP 11 )()(

N

i

xxN

iiN

iixPxxxP1

1

121 1 )()|()|,...,,(

},{ 10nx

tails#head# )()(

11 111

N

ii

N

ii xx

101

xx

xPfor for

)(

xxxP 11 )()(

© Eric Xing @ CMU, 2014 58

Page 59: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Maximum Likelihood Estimation Objective function:

We need to maximize this w.r.t.

Take derivatives wrt

Sufficient statistics The counts, are sufficient statistics of data D

)log()(log)(log)|(log);( 11 hhnn nNnDPD thl

01

hh nNnl

Nnh

MLE

i

iMLE xN1

or

Frequency as sample mean

, where, i ikh xnn

© Eric Xing @ CMU, 2014 59

Page 60: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Overfitting Recall that for Bernoulli Distribution, we have

What if we tossed too few times so that we saw zero head?We have and we will predict that the probability of seeing a head next is zero!!!

The rescue: "smoothing" Where n' is know as the pseudo- (imaginary) count

But can we make this more formal?

tailhead

headheadML nn

n

,0headML

''

nnnnn

tailhead

headheadML

© Eric Xing @ CMU, 2014 60

Page 61: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bayesian Parameter Estimation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:

This is Bayes Rule

likelihood marginalpriorlikelihoodposterior

dpDppDp

DppDpDp

)()|()()|(

)()()|()|(

The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2014 61

Page 62: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Frequentist Parameter Estimation Two people with different priors p() will end up with different estimates p(|D).

Frequentists dislike this “subjectivity”. Frequentists think of the parameter as a fixed, unknown

constant, not a random variable. Hence they have to come up with different "objective"

estimators (ways of computing from data), instead of using Bayes’ rule. These estimators have different properties, such as being “unbiased”, “minimum

variance”, etc. The maximum likelihood estimator, is one such estimator.

© Eric Xing @ CMU, 2014 62

Page 63: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Discussion

or p(), this is the problem!

Bayesians know it

© Eric Xing @ CMU, 2014 63

Page 64: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bayesian estimation for Bernoulli Beta distribution:

When x is discrete

Posterior distribution of :

Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior and are hyperparameters (parameters of the prior) and correspond to the

number of “virtual” heads/tails (pseudo counts)

1111

1

11 111 thth nnnn

N

NN xxp

pxxpxxP )()()(),...,(

)()|,...,(),...,|(

1111 11

)(),()(

)()()(),;( BP

!)()( xxxx 1

© Eric Xing @ CMU, 2014 64

Page 65: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bayesian estimation for Bernoulli, con'd Posterior distribution of :

Maximum a posteriori (MAP) estimation:

Posterior mean estimation:

Prior strength: A=+ A can be interoperated as the size of an imaginary data set from which we obtain

the pseudo-counts

1111

1

11 111 thth nnnn

N

NN xxp

pxxpxxP )()()(),...,(

)()|,...,(),...,|(

NndCdDp hnn

Bayesth 11 1 )()|(

Bata parameters can be understood as pseudo-counts

),...,|(logmaxarg NMAP xxP 1

© Eric Xing @ CMU, 2014 65

Page 66: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Effect of Prior Strength Suppose we have a uniform prior (==1/2),

and we observe Weak prior A = 2. Posterior prediction:

Strong prior A = 20. Posterior prediction:

However, if we have enough data, it washes away the prior. e.g., . Then the estimates under weak and strong prior are and , respectively, both of which are close to 0.2

),( 82 th nnn

25010221282 .)',,|(

th nnhxp

40010202102082 .)',,|(

th nnhxp

),( 800200 th nnn

100022001

10002020010

© Eric Xing @ CMU, 2014 66

Page 67: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Example 2: Gaussian density Data:

We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3}

Model:

Log likelihood:

MLE: take derivative and set to zero:

22212 22 /)(exp)( /

xxP

N

n

nxNDPD1

2

22

212

2 )log()|(log);(l

n n

n n

xN

x

2422

2

21

2

1

l

l )/(

n MLnMLE

n nMLE

xN

xN

22 1

1

© Eric Xing @ CMU, 2014 67

Page 68: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is

where the scatter matrix is

The sufficient statistics are nxn and nxnxnT.

Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not

invertible

SN

xxN

xN

nT

MLnMLnMLE

n nMLE

11

1

TMLMLn n

Tnnn

TMLnMLn NxxxxS

Kn

n

n

n

x

xx

x

2

1

TN

T

T

x

xx

X2

1

© Eric Xing @ CMU, 2014 68

Page 69: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bayesian estimation

Normal Prior:

Joint probability:

Posterior:

2020

2120 22 /)(exp)( /

P

1

20

22

020

2

20

20

2

2 11

11

N

Nx

NN ~ and ,

///

///~ where

Sample mean

2020

2120

1

22

22

22

212

/)(exp

exp),(

/

/

N

nn

N xxP

22212 22 ~/)~(exp~)|( /

xP

© Eric Xing @ CMU, 2014 69

Page 70: Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Bayesian estimation: unknown µ, known σ

The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.

The precision of the posterior 1/σ2N is the precision of the prior 1/σ2

0 plus one contribution of data precision 1/σ2 for each observed data point.

Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)

Effect of single data point

Uninformative (vague/ flat) prior, σ20 →∞

1

20

22

020

2

20

20

2

2 11

11

N

Nx

NN

N~ ,

///

///

20

2

20

020

2

20

001

)( )( xxx

0 N

© Eric Xing @ CMU, 2014 70