32
Supervised Learning Maximum Likelihood Estimation

Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Supervised Learning Maximum Likelihood Estimation

Page 2: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Page 3: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Why MLE?

¤  Maximum Likelihood Estimation is a very very very very fundamental part of data analysis

¤  “MLE for Gaussians” is training wheels for our future techniques

¤  Learning Gaussians is more useful than you might guess

Page 4: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Why Gaussian (Normal)?

¤  Gaussian distributions are ¤  a family of distributions that have the same general shape

¤  are symmetric with scores more concentrated in the middle than in the tails

¤  Why they are important? ¤  many psychological and educational variables are distributed

approximately normally

¤  easy for mathematical statisticians to work with

¤  Formally ¤  Normal distribution N(µ, σ2) approximates

sums of independent identically distributed

random variables

)2/()(

2

22

21)( σµ

πσ−−= xexf

Page 5: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Learning Gaussians from Data

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ (you do know σ2)

¤  MLE: For which μ is x1, x2, ... xR most likely?

µmle = arg maxµ

p(x1, x2,..., xR |µ,σ 2 )

= arg maxµ

p(xi |µ,σ 2 )i=1

R

= arg maxµ

log p(xi |µ,σ 2 )i=1

R

= arg maxµ

−(xi −µ)2

2σ 2i=1

R

= arg minµ

(xi −µ)2

i=1

R

by i.i.d

monotonicity of log

plug in the formula of Gaussian

after simplification

Page 6: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE Strategy

Task: Find MLE θ assuming known form for p(Data| θ,stuff)

1.  Write LL = log P(Data| θ, stuff)

2.  Work out ∂LL/∂θ using high-school calculus

3.  Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ

4.  Solve it

5.  Check that you have found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained

Page 7: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

The MLE μ

µmle = arg maxµ

p(x1, x2,..., xR |µ,σ 2 )

= arg minµ

(xi −µ)2

i=1

R

= µ s.t 0=∂LL∂µ

=∂∂µ

(xi −µ)2

i=1

R

= − 2(xi −µ)i=1

R

µ =1R

xii=1

R

Page 8: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Interesting…

¤  The best estimate of the mean of a distribution is the mean of the sample!!

µ =1R

xii=1

R

Page 9: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

A General MLE Strategy

Suppose θ = (θ1, θ2, ..., θn)T is a vector of parameters

Task: Find MLE θ assuming known form for p(Data| θ,stuff)

1.  Write LL = log P(Data| θ,stuff)

2.  Work out ∂LL/∂θ using high-school calculus

3.  Solve the set of simultaneous equations

4.  Check that you are at the maximum

∂LL∂θ

=

∂LL∂θ1∂LL∂θ2∂LL∂θn

"

#

$$$$$$$$$

%

&

'''''''''

Page 10: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE for Univariate Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ and σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

log p(x1, x2,..., xR |µ,σ2 ) = −R(logπ + 1

2logσ 2 )− 1

2σ 2 (xi −µ)2

i=1

R

∂LL∂µ

=1σ 2 (xi −µ)

i=1

R

∂LL∂σ 2 = −

R2σ 2 +

12σ 4 (xi −µ)

2

i=1

R

Page 11: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE for Univariate Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ and σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

log p(x1, x2,..., xR |µ,σ2 ) = −R(logπ + 1

2logσ 2 )− 1

2σ 2 (xi −µ)2

i=1

R

0 = 1σ 2 (xi −µ)

i=1

R

0 = − R2σ 2 +

12σ 4 (xi −µ)

2

i=1

R

Page 12: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE for Univariate Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,σ2)

¤  But you don’t know μ or σ2

¤  MLE: For which θ =(μ,σ2) is x1, x2,...xR most likely?

µmle =1R

xii=1

R

σ mle2 =

1R

(xi −µmle )2

i=1

R

Page 13: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Exercise (Applying MLE to another model)

James Bond Tasting Martini

¤  We gave to Mr. Bond a series of 16 taste tests

¤  At each test, we flipped a fair coin to determine

whether to stir or shake the martini

¤  Mr. Bond was correct on 13/16 tests

¤  How can we model this experiment?

¤  Use MLE to estimate the parameters of your model.

Page 14: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Page 15: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Unbiased Estimators

¤  An estimator of a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameter

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E µmle!" #$= E1R

xii=1

R

∑!

"&

#

$'= µ

µmle is unbiased

Page 16: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Biased Estimators

¤  An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameter

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E σ 2mle

!" #$= E1R

(xi −µmle )2

i=1

R

∑!

"'

#

$(= E

1R

xi −1R

xjj=1

R

∑)

*++

,

-..

2

i=1

R

∑!

"

''

#

$

((≠σ 2

σ2mle is biased

Page 17: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE Variance Bias

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E σ 2mle

!" #$= E1R

xi −1R

xjj=1

R

∑'

())

*

+,,

2

i=1

R

∑!

"

--

#

$

.

.= 1− 1

R'

()

*

+,σ 2 ≠σ 2

Intuition check: consider the case of R=1

Why should we expect that σ2mle would be

an underestimate of true σ2?

Page 18: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE Variance Bias

¤  If x1, x2, ... xR ~(i.i.d) N(x,σ2μ) then

E σ 2unbiased

!" #$=σ2

we define

σ 2unbiased =

σ 2mle

1− 1R

"

#$

%

&'

E σ 2mle

!" #$= E1R

xi −1R

xjj=1

R

∑'

())

*

+,,

2

i=1

R

∑!

"

--

#

$

.

.= 1− 1

R'

()

*

+,σ 2 ≠σ 2

so

Page 19: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Unbiaseditude discussion

¤  Which is best?

¤  Answer: ¤  It depends on the task

¤  And does not make much difference once R--> large

σ mle2 =

1R

(xi −µmle )2

i=1

R

σ unbiased2 = 1

R−1(xi −µ

mle )2

i=1

R

Page 20: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Page 21: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

Χkk=1

R

∑mle = −1R

(Χk −µmle )(Χk −µ

mle )Tk=1

R

Page 22: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

Χkk=1

R

∑mle = −1R

(Χk −µmle )(Χk −µ

mle )Tk=1

R

µimle =

1R

Χkik=1

R

Where 1 ≤ i ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And μi

mle is the ith component of μmle

Page 23: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE for m-dimensional Gaussian

¤  Suppose you have x1, x2, ... xR ~(i.i.d) N(μ,Σ)

¤  But you don’t know μ or Σ

¤  MLE: For which θ =(μ,Σ) is x1, x2, ... xR most likely?

µmle =1R

Χkk=1

R

∑mle = −1R

(Χk −µmle )(Χk −µ

mle )Tk=1

R

σ ijmle =

1R

(xkik=1

R

∑ −µimle )(xkj −µ j

mle )

Where 1 ≤ i ≤ m, 1 ≤ j ≤ m And xki is value of the ith component of xk (the ith attribute of the kth record) And σij

mle is the (i,j)th component of Σmle

Page 24: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Road Map

1.  Learning Univariate Gaussians

2.  What is a biased estimator?

3.  Learning Multivariate Gaussians

4.  Hypothesis Testing

Page 25: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Confidence Intervals

¤  Indicate the reliability of an estimate

¤  How to estimate this interval using hypothesis testing?

Page 26: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Hypothesis Testing

¤  We gave to Mr. Bond a series of 16 taste tests

¤  At each test, we flipped a fair coin to determine whether to stir or shake the martini

¤  Mr. Bond was correct on 13/16 tests: ¤  Is Mr. Bond able to

distinguish between stirred and shacked martini?

¤  Was he just lucky?

James Bond Tasting Martini Using a binomial distribution (N=16, k=13, p=0.5)

¤  P( someone who is lucky

would be correct 13/16 or more)= 0.0106

¤  The hypothesis that Mr. Bond was luky is not proven false (but considerable doubt is cast on it)

¤  There is a strong evidence that Mr. Bond can tell whether a drink was shaken or stirred

From http://onlinestatbook.com/rvls.html

Page 27: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Probability Value (p-value)

¤  In the James Bond example, the computed probability of 0.0106 is

the probability he would be correct on 13 or more taste tests (out of 16) if he was just guessing

¤  Important ¤  This is not the probability that he cannot tell the difference

¤  The probability of 0.016 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing.)

¤  Using statistical terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome.

Page 28: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Null Hypothesis & Statistical Significance

¤  In the previous example, the hypothesis that an effect is due to chance is called the null hypothesis

¤  The null hypothesis is typically the opposite of the researcher's hypothesis

¤  The null hypothesis is rejected when the probability value is lower than a specific threshold (0.05 or 0.01) called test level

¤  When the null hypothesis is rejected, the effect is said to be statistically significant

Page 29: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Statistical Hypothesis Testing

¤  A hypothesis test determines a probability 1-α

(test level α, significance level) that a sample X1,…,Xn from some unknown probability distribution has a certain property

¤  Examples ¤  under the assumption of a normal distribution or has mean m

¤  General Form

Null hypothesis H0 vs, alternative hypothesis H1

Needs test variable X (derived from X1,…,Xn,H0,H1) and

Test region R with

X ∈R for rejecting H0 and

X∉R for retaining H0

Retain H0 Reject H0

H0 true √ Type I error

H1 true Type II error √

Page 30: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Hypothesis and p-values

¤  Hypothesis ¤  A hypothesis of the form θ= θ0 is called a simple hypothesis ¤  A hypothesis of the form θ > θ0 or θ < θ0 is called a composite

hypothesis

¤  Tests ¤  A test of the form H0: θ= θ0 vs.. H1: θ≠θ is called a two-sided test

¤  A test of the form H0: θ <= θ0 vs.. H1: θ>θ0 or H0: θ >= θ0 vs.. H1: θ<θ0 is called a one-sided test

¤  P-value ¤  Small p-value means strong evidence against H0

Page 31: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

Summary

¤  The Recipe for MLE

¤  Understand MLE estimation of Gaussian parameters

¤  Understand “biased estimator” versus “unbiased estimator”

¤  Understand Confidence Intervals

Page 32: Supervised Learningmkacimi/lecture12.pdf · Exercise (Applying MLE to another model) James Bond Tasting Martini ! We gave to Mr. Bond a series of 16 taste tests ! At each test, we

MLE Exercise

¤  A representative from the National Football Organization randomly selects people on a random street in Rome until he finds a person who attended the last home football game. Let Θ, the probability that he succeeds in finding such a person. And, let X denote the number of people he selects until he finds his first success.

¤  Do you have an idea how can we model this trial?

¤  What is the likelihood function given a sample X(x1,x2, …,xR)

¤  Estimate the parameter Θ

f (X = x,θ ) =θ × (1−θ )x−1