42
Parameter Estimation

parameter - UC Santa Barbara

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: parameter - UC Santa Barbara

Parameter Estimation

Page 2: parameter - UC Santa Barbara

2PR , ANN, & ML

Notational Convention Probabilities

Mass (discrete) function: capital lettersDensity (continuous) function: small letters

Vector vs. scalar Scalar: plainVector: bold 2D: smallHigher dimension: capital

Notes in a continuous state of fluctuation until a topic is finished (many updates)

Page 3: parameter - UC Santa Barbara

3PR , ANN, & ML

Parameter Estimation Optimal classifier maximizes

a prior probability class-conditional density

Assumption no correlation time independent statistics

)()()|()|(

xxx

pPpp ii

i

Page 4: parameter - UC Santa Barbara

4PR , ANN, & ML

Popular Approaches Parametric: assume a certain parametric

form for p(x|wi) and estimate the parameters Nonparametric: does not assume a

parametric form for p(x|wi) and estimate the density profile directly

Boundary: estimate the separation hyperplane (hypersurface) between p(x|wi) and p(x|wj)

Page 5: parameter - UC Santa Barbara

5PR , ANN, & ML

a prior probability Given the numbers of occurrence:

if number of samples are large enough the selection process is not biasedCaveat: sampling may be biased

kiMnP

Mn

nnn

ii

k

ii

kk

,,1)(

),(,),,(),,(

1

2211

Page 6: parameter - UC Santa Barbara

6PR , ANN, & ML

Class conditional density More complicated (not a single number, but

a distribution) assume a certain form estimate the parameters

What form should we assume? Many, but in this courseWe use almost exclusively Gaussian

Page 7: parameter - UC Santa Barbara

7PR , ANN, & ML

Gaussian (or Normal) Scalar case

Vector case

Unknowns

class mean and variance

2

2)(21

21),()|( i

iux

iiii eNxp

)]()[(21

2/1

1

||21)()|( i

TieNp

idiii

uxΣux

ΣΣ,μx

Gaussian Distribution

Page 8: parameter - UC Santa Barbara

8PR , ANN, & ML

feature

2

2)(21

21

ux

e

population

2

Page 9: parameter - UC Santa Barbara

9PR , ANN, & ML

Why Gaussian (Normal)? Central limit theorem predicts normal distribution

from IID experiments In reality

There are only two numbers in the scalar case (mean and variance) to estimate, (or d + d(d+1)/2 in d-dimensions)

Nice mathematical properties (e.g., Fourier transform of a Gaussian is a Gaussian. Products and summation of Gaussian remain Gaussian, Any linear transform of a Gaussian is a Gaussian)

Page 10: parameter - UC Santa Barbara

10PR , ANN, & ML

Projection

Transformation

In particular, a whitening transform can diagonalize the covariance matrix

Page 11: parameter - UC Santa Barbara

11PR , ANN, & ML

Parameter Estimation

Maximum likelihood estimator Parameters have fixed but unknown values

Bayesian estimator parameters as random variables with know a

prior distributionsBayesian estimator allows us to change the a

priori distribution by incorporating measurements to sharpen the profile

Page 12: parameter - UC Santa Barbara

12PR , ANN, & ML

Graphically MLE Bayesian

parameters

likelihood

Page 13: parameter - UC Santa Barbara

13PR , ANN, & ML

Maximum Likelihood Estimator Given

n labeled samples (observations)

an assumed distribution of e parameters

samples are drawn independently from

Find

parameter that best explains the observations

},,,{ 21 nxxxX

},,,{ 21 e θ

),|()|( θXX jj pp

Page 14: parameter - UC Santa Barbara

14PR , ANN, & ML

MLE Formulation

Maximize

n

jj

j

n

j

ppl

pp

1

1

)|(log)(log)(

)|()|(

θxθ|Xθ

θxθX

Or

n

jjpl

p

10)|(log)(

0)(

θxθ

θ|X

θθ

θ

Log likelihood

Page 15: parameter - UC Santa Barbara

15PR , ANN, & ML

An Example

22

21

2

2

1

2

21

2

221

2

22

)(21

)(21

21

)(

)|(log

)(21log

21)|(log

)(21log

21)|(log

21)|( 2

2

j

j

j

jj

jj

ux

j

x

x

xp

xxp

u

uxxp

expj

θ

θ

θ

θ

θ

Page 16: parameter - UC Santa Barbara

16PR , ANN, & ML

An Example (cont.)

2

12

2

11

1 12

2

21

2

1 2

1

)ˆ(1ˆ

0)(1

0)(

uxn

xn

x

x

n

jj

n

jj

n

j

n

j

j

n

j

j

Class mean as sample meanclass variance as sample variance

)(ˆ2

1)()ˆ,ˆ()(

1)|()(2

2

ˆ)ˆ(

21

i

ux

iiiiii pepN

xpxpxg i

i

n-1, MLE is biased!

Page 17: parameter - UC Santa Barbara

17PR , ANN, & ML

2

22

2

21 )(

21

1

)(21

1 21

21

uxn

j

uxn

jee

1 2coin weight

population

Page 18: parameter - UC Santa Barbara

18PR , ANN, & ML

22

2

21

2 )ˆ(21

21

)ˆ(21

11 21

21

uxn

j

uxn

jee

12

0x

If too narrow, many sampling points will be outside 2 width with low likelihood of occurrence

If too wide, 1/ becomes too small and reduces the likelihood of occurrence

Page 19: parameter - UC Santa Barbara

19PR , ANN, & ML

A Quick Word on MAP MAP (Maximum a posteriori) estimator Similar to MLE with one additional twist

Maximize the (log) likelihood, l(.) and p(.), prior probability of parameter values (if

you know it), e.g., the mean is more likely to be uo with a normal distribution

MLE has a uniform prior, MAP not necessarily

The added term is a case of “regularization”

Page 20: parameter - UC Santa Barbara

20PR , ANN, & ML

Bayesian Estimator Note that MLE is a batch estimator

All data have to be keptDifficult to update estimationDifficult to incorporate other evidence Insist on a single measurement

Bayesian estimatorAllow the freedom that parameters in

themselves can be random variablesAllow multiple evidence Allow iterative update

Page 21: parameter - UC Santa Barbara

21PR , ANN, & ML

Bayesian Estimator Based on Bayes rule

With X at our disposal

jjj

iiii Pxp

PxpxP

xPxP)()|(

)()|()(

),()|(

jjj

iiii Pp

PpP

PP)|(),|(

)|(),|()(

),,(),|(XXxXXx

Xx,XxXx

Page 22: parameter - UC Santa Barbara

22PR , ANN, & ML

Bayes Rule Formulation Assume

X comes from only one class is independent of X

jjjj

iiiiii Pp

PpP

Pp)(),|(

)(),|(),(

),,()|(

XxXx

XxXxXx,

)( ip

Page 23: parameter - UC Santa Barbara

23PR , ANN, & ML

How can X be used? The distribution is known (e.g., normal), the

parameters are unknown For estimating class parameters class parameters then constrain x put it all together

)( X|θp)( θ|xp

θX|θθ|xX|x dppp )()()(

Page 24: parameter - UC Santa Barbara

24PR , ANN, & ML

Ideally

θX|θθ|xX|x dppp )()()(

)ˆ()()()(

0

ˆ1)(

θ|xθX|θθ|xX|x

θX|θ

pdppp

otherwisesomep

Otherwise, all possible ’s are used

Bayes Rule Formulation (cont.)

This is MLE!

Page 25: parameter - UC Santa Barbara

25PR , ANN, & ML

Graphic Interpretation

},,,{ 21 nxxxX

},,,{ 21 e θ }',,','{' 21 e θ }",,","{" 21 e θ},,,{ 21 e θ

)()|( Pp θx

Page 26: parameter - UC Santa Barbara

26PR , ANN, & ML

An example Estimating mean of a normal distribution Variance is known Using n samples First step

2

2

2

2

)(21

)(21

1

21),()(

21)|(

)()()|()|(

o

o

k

eNup

ep

puppp

ooo

xn

k

X

XXX

duupuxpxp )|()|()|( XX

Currentevidence

Previous and otherevidence

Key to Bayesian: Both current and prior

evidence can be used

Page 27: parameter - UC Santa Barbara

27PR , ANN, & ML

Then

2

2

21

22

22

2

2

2

2

)(21})(2)1{(

21

)(21)(

21

1

21'

21

21)|(

n

n

o

on

kk

o

o

ok

ee

eep

n

xnn

o

xn

k

X

n

kkn

noo

on

oo

no

on

xn

m

nif

n

nm

nn

1

2222

22

222

22

2

22

2

1

1

Page 28: parameter - UC Santa Barbara

28PR , ANN, & ML

X helps in Defining the mean Reducing the uncertainty in mean Trust new data if

Class variance is small Number of sample is large Prior is uncertain

o

nm

2

2

o

n

2

2

o

n

2

2

o

n

Page 29: parameter - UC Santa Barbara

29PR , ANN, & ML

Second step2

2)(21

21)|(

x

exp

dxufwhere

fN

dee

dupxpxpxg

n

nn

n

nn

nnn

u

n

xn

n

)}(21exp{),(

),(),(

}2

121{

)|()|()|()(

22

22

22

22

22

)(21)(

21

2

2

2

2

XX

An example (cont.)

Third step

Page 30: parameter - UC Santa Barbara

30PR , ANN, & ML

Graphical Interpretation: MLE)(p

2

2)(21

1 21)|(

kxn

k

ep X

)(p

Page 31: parameter - UC Santa Barbara

31PR , ANN, & ML

Graphical Interpretation: Bayesian

)(p

2

2)(21

1 21)|(

kxn

k

ep X

)(p

)(21

)()|()|(

2

2)(21

1

upe

upupupkxn

k

XX

Page 32: parameter - UC Santa Barbara

32PR , ANN, & ML

Results of Iterative Process Start with a prior distribution Incorporate current batch of data Generate a new prior Goodness of new prior = goodness of old

prior * goodness of interpretation Usually

Prior distribution sharpen (Bayesian learning)Uncertainty drops

Page 33: parameter - UC Santa Barbara

33PR , ANN, & ML

MLE vs. Bayes Faster (differentiation) Single model Known model p(x|)

Less information

Slow (integration) Multiple weighted Unknown model fine More information

(nonuniform prior)

Page 34: parameter - UC Santa Barbara

34PR , ANN, & ML

Does it really make a difference? Yes, Bayesian classifier and MAP will in

general give different results when used to classify new samples

Because MAP (MLE) keeps only one hypothesis while Bayesian keeps multiple, weighted hypotheses

Page 35: parameter - UC Santa Barbara

35PR , ANN, & ML

Example MLE Bayesian

)|(maxarg'

),'|(maxarg)|'(

Xθθ

θxXxx

Pwhere

pp

θx

θX|θθ|xX|x' dppp )()(maxarg)(

0)|(,1)|(,3.)|(0)|(,1)|(,3.)|(

1)|(,0)|(,4.)|(

333

222

111

PPpPPpPPp

XXX

)( X|xp

)|(4.0*3.0*3.1*4.)|(6.1*3.1*3.0*4.)|(

XXX

xppp

Only one hypothesis (1) is kept

Page 36: parameter - UC Santa Barbara

36PR , ANN, & ML

Gibbs Sampler Bayesian classifier is optimal, but can be

very expensive – especially when a large number of hypotheses are kept and evaluated

Gibbs – randomly pick one hypothesis according to the current posterior distribution

Can be shown (later) to be related knn classifier and the expected error is at mosttwice as bad as Bayesian

)( X|θp

Page 37: parameter - UC Santa Barbara

37PR , ANN, & ML

An Example: Naïve Bayesian Features are a conjunction of attributes Bayes theorem states that a posterior

probability should be maximized Naïve Bayesian classifier assumes

independence of attributes

)|()(maxarg),,,(

)()|,,,(maxarg

),,,|(maxarg

21

21

21

jiijc

n

jjn

c

njc

caPcPaaaP

cPcaaaP

aaacPc

j

j

j

Page 38: parameter - UC Santa Barbara

38PR , ANN, & ML

Example

Day Outlook Temperature Humidity Wind Play tennis

D1 Sunny Hot High Weak No

D2 Sunny Host High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cold Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Page 39: parameter - UC Santa Barbara

39PR , ANN, & ML

Example (cont) <Outlook=sunny, Temperature=cool,

Humidity=high, Wind=strong> PlayTennis=yes? Or no?

)|()|(

)|()|()(maxarg},{

jj

jjjnoyesc

NB

cstrongWindPchighHumidityP

ccooleTemperaturPcsunnyOutlookPcPcj

6.53)|(

33.93)|(

36.145)(

64.149)(

nostrongWindP

yesstrongWindP

noplayTennisP

yesplayTennisP

0206.0)|()|()|()|()(

0053.0)|()|()|()|()(

nostrongPnohighPnocoolPnosunnyPnoP

yesstrongPyeshighPyescoolPyessunnyPyesP

Page 40: parameter - UC Santa Barbara

40PR , ANN, & ML

Caveat Guarding against zero probability P(ai|cj)

Especially for small sample sizes and large set of attribute values

Use m-estimate instead If attribute ai can take k values, then p=1/k

estimateprior :samples) more m (add size sample equivalent :

cin samples of #:

a attribute with cin samples of #:

)|(

j

ij

pm

n

n

mnmpn

cap

j

i

j

i

c

a

c

aji

Page 41: parameter - UC Santa Barbara

41PR , ANN, & ML

More Examples Web page classification/Newsgroup

classification Like/dislike for web pages Science/sports/entertainment categories for

web pages/newsgroups

Page 42: parameter - UC Santa Barbara

42PR , ANN, & ML

More Examples (cont.) Select common occurring words as features

(at least k times in documents) Eliminate stop words (the, it, etc.) and

punctuations Word stemming (like, liked etc.) P(wordk |classj) is independent of word

position in the document Achieve 89% accuracy for classifying

documents for 20 newsgroups