37
CS 59000 Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept. 16 2008 Acknowledgement: Sargur Srihari’s slides

CS 59000 Statistical Machine learning Lecture 7

Embed Size (px)

DESCRIPTION

CS 59000 Statistical Machine learning Lecture 7. Yuan (Alan) Qi Purdue CS Sept. 16 2008. Acknowledgement: Sargur Srihari’s slides. Outline. Review of noninformative priors, nonparametric methods, and nonlinear basis functions Regularized regression Bayesian regression Equivalent kernel - PowerPoint PPT Presentation

Citation preview

Page 1: CS 59000 Statistical Machine learning Lecture 7

CS 59000 Statistical Machine learningLecture 7

Yuan (Alan) QiPurdue CS

Sept. 16 2008

Acknowledgement: Sargur Srihari’s slides

Page 2: CS 59000 Statistical Machine learning Lecture 7

Outline

Review of noninformative priors, nonparametric methods, and nonlinear basis functions

Regularized regressionBayesian regressionEquivalent kernel Model Comparison

Page 3: CS 59000 Statistical Machine learning Lecture 7

The Exponential Family (1)

where ´ is the natural parameter and

so g(´) can be interpreted as a normalization coefficient.

Page 4: CS 59000 Statistical Machine learning Lecture 7

Property of Normalization Coefficient

From the definition of g(´) we get

Thus

Page 5: CS 59000 Statistical Machine learning Lecture 7

Conjugate priors

For any member of the exponential family, there exists a prior

Combining with the likelihood function, we get

Prior corresponds to º pseudo-observations with value Â.

Page 6: CS 59000 Statistical Machine learning Lecture 7

Noninformative Priors (1)

With little or no information available a-priori, we might choose a non-informative prior.• ¸ discrete, K-nomial :• ¸2[a,b] real and bounded: • ¸ real and unbounded: improper!

A constant prior may no longer be constant after a change of variable; consider p(¸) constant and ¸=´2:

Page 7: CS 59000 Statistical Machine learning Lecture 7

Noninformative Priors (2)

Translation invariant priors. Consider

For a corresponding prior over ¹, we have

for any A and B. Thus p(¹) = p(¹ { c) and p(¹) must be constant.

Page 8: CS 59000 Statistical Machine learning Lecture 7

Noninformative Priors (4)

Scale invariant priors. Consider and make the change of variable

For a corresponding prior over ¾, we have

for any A and B. Thus p(¾) / 1/¾ and so this prior is improper too. Note that this corresponds to p(ln

¾) being constant.

Page 9: CS 59000 Statistical Machine learning Lecture 7

Nonparametric Methods (1)

Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.

Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.

Page 10: CS 59000 Statistical Machine learning Lecture 7

Nonparametric Methods (2)

Histogram methods partition the data space into distinct bins with widths ¢i and count the number of observations, ni, in each bin.

•Often, the same width is used for all bins, ¢i = ¢.•¢ acts as a smoothing parameter.

•In a D-dimensional space, using M bins in each dimen-sion will require MD bins!

Page 11: CS 59000 Statistical Machine learning Lecture 7

Nonparametric Methods (3)

Assume observations drawn from a density p(x) and consider a small region R containing x such that

The probability that K out of N observations lie inside R is Bin(KjN,P ) and if N is large

If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and

Thus

Page 12: CS 59000 Statistical Machine learning Lecture 7

Nonparametric Methods (5)

To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian

Any kernel such that

will work.

h acts as a smoother.

Page 13: CS 59000 Statistical Machine learning Lecture 7

K-Nearest-Neighbours for Classification (1)

Given a data set with Nk data points from class Ck and , we have

and correspondingly

Since , Bayes’ theorem gives

Page 14: CS 59000 Statistical Machine learning Lecture 7

K-Nearest-Neighbours for Classification (2)

K = 1K = 3

Page 15: CS 59000 Statistical Machine learning Lecture 7

Basis Functions

Page 16: CS 59000 Statistical Machine learning Lecture 7

Examples of Basis Functions (1)

Page 17: CS 59000 Statistical Machine learning Lecture 7

Maximum Likelihood Estimation (1)

Page 18: CS 59000 Statistical Machine learning Lecture 7

Maximum Likelihood Estimation (2)

Page 19: CS 59000 Statistical Machine learning Lecture 7

Sequential Estimation

Page 20: CS 59000 Statistical Machine learning Lecture 7

Regularized Least Squares

Page 21: CS 59000 Statistical Machine learning Lecture 7

More Regularizers

Page 22: CS 59000 Statistical Machine learning Lecture 7

Visualization of Regularized Regression

Page 23: CS 59000 Statistical Machine learning Lecture 7

Bayesian Linear Regression

Page 24: CS 59000 Statistical Machine learning Lecture 7

Posterior Distributions of Parameters

Page 25: CS 59000 Statistical Machine learning Lecture 7

Predictive Posterior Distribution

Page 26: CS 59000 Statistical Machine learning Lecture 7

Examples of PredictiveDistribution

Page 27: CS 59000 Statistical Machine learning Lecture 7

Question

Suppose we use Gaussian basis functions.

What will happen to the predictive distribution if we evaluate it at places far from all training data points?

Page 28: CS 59000 Statistical Machine learning Lecture 7

Equivalent Kernel

Given Predictive mean is

where

Page 29: CS 59000 Statistical Machine learning Lecture 7

Equivalent kernel

Basis Function: Equivalent kernel:Gaussian

Polynomial

Sigmoid

Page 30: CS 59000 Statistical Machine learning Lecture 7

Covariance between two predictions

Predictive mean at nearby points will be highly correlated, whereas for more distant pairs of points the correlation will be smaller.

Page 31: CS 59000 Statistical Machine learning Lecture 7

Bayesian Model Comparison

Suppose we want to compare models .Given a training set , we compute

Model evidence (also known as marginal likelihood):

Bayes factor:

Page 32: CS 59000 Statistical Machine learning Lecture 7

Likelihood, Parameter Posterior & Evidence

Likelihood and evidence

Parameter posterior distribution and evidence

Page 33: CS 59000 Statistical Machine learning Lecture 7

Crude Evidence Approximation

Assume posterior distribution is centered around its mode

Page 34: CS 59000 Statistical Machine learning Lecture 7

Evidence penalizes over-complex models

Given M parameters

Maximizing evidence leads to a natural trade-off between data fitting & model complexity.

Page 35: CS 59000 Statistical Machine learning Lecture 7

Evidence Approximation & Empirical BayesApproximating the evidence by maximizing

marginal likelihood.

Where hyperparameters maximize the evidence .

Known as Empirical Bayes or type2 maximum likelihood

Page 36: CS 59000 Statistical Machine learning Lecture 7

Model Evidence and Cross-Validation

Root-mean-square error Model evidence

Fitting polynomial regression models

Page 37: CS 59000 Statistical Machine learning Lecture 7

Next class

Linear Classification