Linear Regression

Slide 1

Oliver SchulteMachine Learning 726Linear Regression#/13If you use insert slide number under Footer, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.1The Linear Regression Model#/132Parameter Learning ScenariosThe general problem: predict the value of a continuous variable from one or more continuous features.Parent Node/Child NodeDiscreteContinuousDiscreteMaximum LikelihoodDecision Treeslogit distribution(logistic regression)Continuousconditional Gaussian(not discussed)linear Gaussian(linear regression)#/133Example 1: Predict House pricesFigure Russell and NorvigPrice vs. floor space of houses for sale in Berkeley, CA, July 2009.

SizePrice#/13Line shows best fit.Univariate Example.4Grading ExamplePredict: final percentage mark for student.Features: assignment grades, midterm exam, final exam.Questions we could ask.I forgot the weights of components. Can you recover them from a spreadsheet of the final grades?I lost the final exam grades. How well can I still predict the final mark?How important is each component, actually? Could I guess well someones final mark given their assignments? Given their exams?

#/13see examples/regression5Line FittingInput: a data table XNxD.a target vector tNx1.Output: a weight vector wDx1.Prediction model: predicted value = weighted linear combination of input features.

#/13What is N? What is D?6Least Squares Error

We seek the closest fit of the predicted line to the data points. Error = the sum of squared errors.Sensitive to outliers.#/13sorry about the notation change, but its good for you.Demonstrate in spreadsheet.7Squared Error on House Price ExampleFigure 18.13 Russell and Norvig

#/13Legend: left: Given values for w_0, w_1, Plot of the squared diff loss for the data shown on the right (as before).Note that the loss function is convex, so there is a single minimum.8IntuitionSuppose that there is an exact solution and that the input matrix is invertible. Then we can find the solution by simple matrix inversion:

Alas, X is hardly ever square let alone invertible. But XTX is square, and usually invertible. So multiply both sides of equation by XT, then use inversion.

#/13If XTX is not invertible, can perturb with identity matrix: XTX+epsilon I.Lets now prove that this is the least-squares solution.9Partial DerivativeThink about single weight parameter wj.Partial derivative is

Gradient changes weight to bring prediction closer to actual value.#/13notice that the gradient is intuitive. Check this in spreadsheet.the minus sign gets reversed by inner derivative.10Gradient

Find gradient vector for each input x, add them up.=Linear combination of row vectors xn with coefficients.xnw-tn.

#/1311Solution: The Pseudo-Inverse

Assume that XTX is invertible. Then the solution is given by

#/1312The w0 offsetRecall the formula for the partial derivative, and that xn0=1 for all n.Write w*=(w1,w2,...,wD) for the weight vector without w0, and similarly xn*=(xn1,xn2,...,xnD) for the n-th feature vector without the dummy input. Then

Setting the partial derivative to 0, we getaverage target valueaverage predicted value#/13derivation: 1. write sum twice, once with just w0 -> N x w0. 2. Move Sum (yn tn) over to left, puts tn postive.3. divide both sides by N.13Geometric InterpretationFigure Bishop

Any vector of the form y = Xw is a linear combination of the columns (variables) of X.If y is the least squares approximation, then y is the orthogonal projection of t onto this subspacei = column vector i#/13The orthogonal projection is the closest vector to t in the subspace.14Probabilistic Interpretation#/1315Noise ModelA linear function predicts a deterministic value yn(xn,w) for each input vector.We can turn this into a probabilistic prediction via a modeltrue value = predicted value + random noise:Lets start with a Gaussian noise model.

#/13Same noise model for all inputs.16Curve Fitting With Noise

#/13dont worry about the beta for now.Gives a measure of uncertainty.Remember turning functions into probabilites. Also for recommendation systems.17

The Gaussian Distribution

#/1318Meet the exponential familyA common way to define a probability density p(x) is as an exponential function of x.Simple mathematical motivation: multiplying numbers between 0 and 1 yields a number between 0 and 1.E.g. (1/2)n, (1/e)x.Deeper mathematical motivation: exponential pdfs have good statistical properties for learning.E.g., conjugate prior, maximum likelihood, sufficient statistics.#/13In fact, the exponential family is the only class with these properties.To see this for conjugate prior, note that the likelihood is typically a product (i.i.d. data). So the posterior is proportional to prior x product. If the prior is also a product (exponential), then the posterior is a product like the prior. If the prior is something else, then something else x product is usually not something else.19Reading exponential prob formulasSuppose there is a relevant feature f(x) and I want to express that the greater f(x) is, the less probable x is.f(x), p(x)Use p(x) = exp(-f(x)).#/1320Example: exponential form sample sizeFair Coin: The longer the sample size, the less likely it is.p(n) = 2-n.ln[p(n)]Sample size n#/13Try to do matlab plot. Slope goes down because of minus sign.21Location ParameterThe further x is from the center , the less likely it is.ln[p(x)](x-)2#/1322Spread/Precision parameterThe greater the spread 2, the more likely x is (away from the mean).The greater the precision , the less likely x is.ln[p(x)]1/2 = #/1323Minimal energy = max probabilityThe greater the energy (of the joint state), the less probable the state is.ln[p(x)]E(x)#/1324NormalizationLet p*(x) be an unnormalized density function.To make a probability density function, need to find normalization constant s.t.

Therefore

For the Gaussian (Laplace 1782)

#/13So all I have to do is solve the integral!25Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Laplace (1810).Example: N uniform [0,1] random variables.

#/1326Gaussian Likelihood FunctionExercise: Assume a Gaussian noise model, so the likelihood function becomes (copy 3.10 from Bishop).

Show that the maximum likelihood solution minimizes the sum of squares error:

#/13proof: take logs of the Gaussian.27Regression With Basis Functions#/1328Nonlinear FeaturesWe can increase the power of linear regression by using functions of the input features instead of the input features.These are called basis functions.Linear regression can then be used to assign weights to the basis functions.#/1329Linear Basis Function Models (1)Generally

where j(x) are known as basis functions.Typically, 0(x) = 1, so that w0 acts as a bias.In the simplest case, we use linear basis functions : d(x) = xd.

#/13Can often think of basis functions as features computed from data vector x.30Linear Basis Function Models (2)Polynomial basis functions:

These are global, the same for all input vectors.

#/13Showing functions for difference choices of exponents.31Linear Basis Function Models (3)Gaussian basis functions:

These are local; a small change in x only affects nearby basis functions. j and s control location and scale (width).

Related to kernel methods.

#/13Intuition: use features in parts of image or body.32Linear Basis Function Models (4)Sigmoidal basis functions:

where

Also these are local; a small change in x only affect nearby basis functions. j and s control location and scale (slope).

#/13Maps x to 0-1 range. Like smooth treshold. Also like probability.33Basis Function Example Transformation

#/13Legend:left: datapoints in 2D. Red and blue are class labels (ignore). 2 Gaussian basis functions, we see the centers shown as crosses and the countours shown by green circles.Right: Each point is now mapped to another pair of numbers (roughly, distance to phi1, distance to phi2).Intuitive example: think of Gaussian centers as indicating parts of a picture, or parts of a body.34Limitations of Fixed Basis FunctionsM basis functions along each dimension of a D-dimensional input space require MD basis functions: the curse of dimensionality.In later chapters, we shall see how we can get away with fewer basis functions, by choosing these using the training data.#/1335Overfitting and Regularization#/1336Polynomial Curve Fitting

#/13370th Order Polynomial

#/13383rd Order Polynomial

#/13399th Order Polynomial

#/1340Over-fitting

Root-Mean-Square (RMS) Error:

#/1341Polynomial Coefficients

#/13Large weights are a clue to overfitting.42Data Set Size:

9th Order Polynomial#/13Overfitting depends on data set size.431st Order Polynomial

#/1344Data Set Size:

9th Order Polynomial#/1345Quadratic RegularizationPenalize large coefficient values

#/1346Regularization:

#/1347Regularization:

#/1348Regularization: vs.

#/13Bias vs. variance analysis of error: two components to error.49Regularized Least Squares (1)Consider the error function:

With the sum-of-squares error function and a quadratic regularizer, we get

which is minimized by

Data term + Regularization term

is called the regularization coefficient.#/13Problem in assignment. This inverse always exists.50Regularized Least Squares (2)With a more general regularizer, we have

LassoQuadratic#/1351

Regularized Least Squares (3)Lasso tends to generate sparser solutions than a quadratic regularizer. #/13But is not rotation invariant.52Evaluating Classifiers and ParametersCross-Validation#/1353Evaluating Learners on Validation SetTraining DataLearnerModelValidation Set#/13Also called testing set, hold-out set54What if there is no validation set?What does training error tell me about the generalization performance on a hypothetical validation set?Scenario 1: You run a big pharmaceutical company. Your new drug looks pretty good on the trials youve done so far. The government tells you to test it on another 10,000 patients.Scenario 2: Your friendly machine learning instructor provides you with another validation set.What if you cant get more validation data?#/1355Examples from Andrew MooreAndrew Moore's slides#/13Also demo on Naive Bayes vs. DT tree on playtennis data set.56Cross-Validation for Evaluating Learners

Cross-validation to estimate the performance of a learner on future unseen dataLearn on 3 folds, test on 1Learn on 3 folds, test on 1Learn on 3 folds, test on 1Learn on 3 folds, test on 1#/13Show Weka demo.4-fold cross validation. Common default = 10. Use to evaluate lambda, stop at first minimum of error function. Jackknife: leave out one data point only, do for all data points. Think about doing this for the Bernoulli distribution.

57Cross-Validation for Meta MethodsTraining DataTraining DataLearner()Model() If the learner requires setting a parameter,we can evaluate different parameter settings against the data using training error or cross validation. Then cross-validation is part of learning.#/1358Cross-Validation for evaluating a parameter value

Cross-validation for set (e.g. = 1) Learn with on 3 folds, test on 1Learn with on 3 folds, test on 1Learn with on 3 folds, test on 1Learn with on 3 folds, test on 1Average error over all 4 runs is the cross-validation estimated error for the value#/134-fold cross validation. Common default = 10. Use to evaluate lambda, stop at first minimum of error function. Jackknife: leave out one data point only, do for all data points. Think about doing this for the Bernoulli distribution.

59Stopping Criterion

for any type of validation (hold-out set validation, cross-validation)#/13Bayesian Linear Regression#/1361Bayesian Linear Regression (1)Define a conjugate shrinkage prior over weight vector w:p(w|2) = N(w|0,2I)Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives a posterior distribution.Log of the posterior = sum of squared errors + quadratic regularization.#/1362Bayesian Linear Regression (3)

0 data points observedPriorData Space#/1363Bayesian Linear Regression (4)

1 data point observedLikelihoodPosteriorData Space#/1364Bayesian Linear Regression (5)

2 data points observedLikelihoodPosteriorData Space#/1365Bayesian Linear Regression (6)

20 data points observedLikelihoodPosteriorData Space#/1366Predictive Distribution (1)Predict t for new values of x by integrating over w.Can be solved analytically.

#/1367Predictive Distribution (2)Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point

#/13Green curve = sin(2 \pi x) + noise.Left: Red curve is mean (Bayesian single prediction). Showing 1 standard deviation of posterior averaged prediction for each data point x.Right: Showing curves that are sampled from the posterior prediction over weights.68Predictive Distribution (3)Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

#/1369Predictive Distribution (4)Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

#/1370Predictive Distribution (5)Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

#/1371

Documents

Linear Regression