28
Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University January 12 th , 2005

Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Point EstimationLinear Regression

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

January 12th, 2005

Page 2: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Announcements

Recitations – New Day and RoomDoherty Hall 1212 Thursdays – 5-6:30pmStarting January 20th

Use mailing [email protected]

Page 3: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Your first consulting job

A billionaire from the suburbs of Seattle asks you a question:

He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up?You say: Please flip it a few times:

You say: The probability is:

He says: Why???You say: Because…

Page 4: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Thumbtack – Binomial Distribution

P(Heads) = θ, P(Tails) = 1-θ

Flips are i.i.d.:Independent eventsIdentically distributed according to Binomial distribution

Sequence D of αH Heads and αT Tails

Page 5: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Maximum Likelihood Estimation

Data: Observed set D of αH Heads and αT Tails Hypothesis: Binomial distribution Learning θ is an optimization problem

What’s the objective function?

MLE: Choose θ that maximizes the probability of observed data:

Page 6: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Your first learning algorithm

Set derivative to zero:

Page 7: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

How many flips do I need?

Billionaire says: I flipped 3 heads and 2 tails.You say: θ = 3/5, I can prove it!He says: What if I flipped 30 heads and 20 tails?You say: Same answer, I can prove it!

He says: What’s better?You say: Humm… The more the merrier???He says: Is this why I am paying you the big bucks???

Page 8: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Simple bound (based on Hoeffding’s inequality)

For N = αH+αT, and

Let θ* be the true parameter, for any ε>0:

Page 9: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

PAC Learning

PAC: Probably Approximate CorrectBillionaire says: I want to know the thumbtack parameter θ, within ε = 0.1, with probability at least 1-δ = 0.95. How many flips?

Page 10: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

What about prior

Billionaire says: Wait, I know that the thumbtack is “close” to 50-50. What can you?You say: I can learn it the Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

Page 11: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Bayesian Learning

Use Bayes rule:

Or equivalently:

Page 12: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Bayesian Learning for Thumbtack

Likelihood function is simply Binomial:

What about prior?Represent expert knowledgeSimple posterior form

Conjugate priors:Closed-form representation of posteriorFor Binomial, conjugate prior is Beta distribution

Page 13: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Beta prior distribution – P(θ)

Likelihood function:Posterior:

Page 14: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Posterior distribution

Prior:Data: αH heads and αT tails

Posterior distribution:

Page 15: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Using Bayesian posterior

Posterior distribution:

Bayesian inference:No longer single parameter:

Integral is often hard to compute

Page 16: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

MAP: Maximum a posteriori approximation

As more data is observed, Beta is more certain

MAP: use most likely parameter:

Page 17: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

MAP for Beta distribution

MAP: use most likely parameter:

Beta prior equivalent to extra thumbtack flipsAs N →∞, prior is “forgotten”But, for small sample size, prior is important!

Page 18: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

What about continuous variables?

Billionaire says: If I am measuring a continuous variable, what can you do for me?You say: Let me tell you about Gaussians…

Page 19: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

MLE for Gaussian

Prob. of i.i.d. samples x1,…,xN:

Log-likelihood of data:

Page 20: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Your second learning algorithm:MLE for mean of a GaussianWhat’s MLE for mean?

Page 21: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

MLE for variance

Again, set derivative to zero:

Page 22: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Learning Gaussian parameters

MLE:

Bayesian learning is also possibleConjugate priors

Mean: Gaussian priorVariance: Wishart Distribution

Page 23: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Prediction of continuous variables

Billionaire says: Wait, that’s not what I meant! You says: Chill out, dude.He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA.You say: I can regress that…

Page 24: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

The regression problemInstances: <xj, tj>Learn: Mapping from x to t(x)Hypothesis space:

Given, basis functionsFind coeffs w={w1,…,wk}

Precisely, minimize the residual error:

Solve with simple matrix operations:Set derivative to zeroGo to recitation Thursday 1/20

Page 25: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Billionaire (again) says: Why sum squared error???You say: Gaussians, Dr. Gateson, Gaussians…

Model:

Learn w using MLE

But, why?

Page 26: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Maximizing log-likelihood

Maximize:

Page 27: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

Bias-Variance Tradeoff

Choice of hypothesis class introduces learning bias

More complex class → less biasMore complex class → more variance

Page 28: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University

What you need to know

Go to recitation for regressionAnd, other recitations too

Point estimation:MLEBayesian learningMAP

Gaussian estimationRegression

Basis function = featuresOptimizing sum squared errorRelationship between regression and Gaussians

Bias-Variance trade-off