83
1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009

1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

Embed Size (px)

Citation preview

Page 1: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

1

CS546: Machine Learning and Natural Language

Probabilistic Classification

Feb 24, 26 2009

Page 2: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

2

Model the problem of text correction as that of generating correct sentences.

Goal: learn a model of the language; use it to predict.

PARADIGM Learn a probability distribution over all sentences

Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the

park)[In the same paradigm we sometimes learn a conditional

probability distribution]

In practice: make assumptions on the distribution’s type

In practice: a decision policy depends on the assumptions

2: Generative Model

Page 3: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

3

Consider a distribution D over space XY X - the instance space; Y - set of labels. (e.g. +/-1)

Given a sample {(x,y)}1m

,, and a loss function L(x,y) Find hH that minimizes

i=1,mL(h(xi),yi) L can be: L(h(x),y)=1, h(x)y, o/w L(h(x),y) = 0 (0-1

loss)

L(h(x),y)= (h(x)-y)2 , (L2)

L(h(x),y)=exp{- y h(x)}

Find an algorithm that minimizes average loss; then, we know that things will be okay (as a function of H).

Before: Error Driven LearningDiscriminative Learning

Page 4: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

4

Goal: find the best hypothesis from some space H of hypotheses, given the observed data D.

Define best to be: most probable hypothesis in H

In order to do that, we need to assume a probability distribution over the class H.

In addition, we need to know something about the relation between the data observed and the hypotheses (E.g., a coin problem.)

As we will see, we will be Bayesian about other things, e.g., the parameters of the model

Bayesian Decision Theory

Page 5: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

5

P(h) - the prior probability of a hypothesis h (prior over H)

Reflects background knowledge; before data is observed. If no information - uniform distribution.

P(D) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis)

P(D|h): The probability of observing the sample D, given that the hypothesis h holds

P(h|D): The posterior probability of h. The probability h holds, given that D has been observed.

Basics of Bayesian Learning

Page 6: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

6

P(h|D) increases with P(h) and with P(D|h)

P(h|D) decreases with P(D)

P(D)P(h)h)|P(DD)|P(h

Bayes Theorem

Page 7: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

7

The learner considers a set of candidate hypotheses H (models), and attempts to find the most probable one h H, given the observed data.

Such maximally probable hypothesis is called maximum a posteriori hypothesis (MAP); Bayes theorem is used to compute it:

P(D)P(h)h)|P(DD)|P(h

h)P(h)|P(Dargmax

P(D)P(h)h)|P(DargmaxD)|P(hargmaxh

Hh

HhHhMAP

Learning Scenario

Page 8: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

8

We may assume that a priori, hypotheses are equally probable

We get the Maximum Likelihood hypothesis:

Here we just look for the hypothesis that best explains the data

h)P(h)|P(DargmaxD)|P(hargmaxh HhHhMAP

Hh,hP(hP(h jiji ),)

h)|P(Dargmaxh HhML

Learning Scenario (2)

Page 9: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

9

How should we use the general formalism? What should H be?

H can be a collection of functions. Given the training data, choose an optimal function. Then, given new data, evaluate the selected function on it.

H can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction.

H can be a collection of (conditional) probability distributions.

Could be different! Specific examples we will discuss:

Naive Bayes: a maximum likelihood based algorithm; Max Entropy: seemingly, a different selection criteria; Hidden Markov Models

Bayes Optimal Classifier

Page 10: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

10

f:XV, finite set of values Instances x X can be described as a collection of features Given an example, assign it the most probable value in V

{0,1}x )x,...,x,(xx in21

),...xx,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj

Bayesian Classifier

Page 11: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

11

• f:XV, finite set of values•Instances x X can be described as a collection of features

• Given an example, assign it the most probable value in V

• Bayes Rule:

• Notational convention: P(y) means P(Y=y)

{0,1}x )x,...,x,(xx in21

),...xx,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj

))P(vv|,...xx,P(xargmax

),...xx,P(x

))P(vv|,...xx,P(xargmax v

jjn21Vv

n21

jjn21VvMAP

j

j

Bayesian Classifier

Page 12: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

12

• Given training data we can estimate the two terms.• Estimating P(vj) is easy. For each value vj count how many times it appears in the training data.

• However, it is not feasible to estimate• In this case we have to estimate, for each target value, the probability of each instance (most of which will not occur)

• In order to use a Bayesian classifiers in practice, we need to make assumptions that will allow us to estimate these quantities.

))P(vv|,...xx,P(xargmax v jjn21VvMAP j

)v|,...xx,P(x jn21

Bayesian Classifier

Page 13: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

13

• Assumption: feature values are independent given the target value

))P(vv|,...xx,P(xargmax v jjn21VvMAP j

n

1i ji

jnjn3jn32jn21

jn3jn32jn21

jn2jn21

jn21

)v|P(x

v|P(xv|,...xP(xv,,...xx|)P(xv,,...xx|P(x

v|,...xP(xv,,...xx|)P(xv,,...xx|P(x

v|,...x)P(xv,,...xx|P(x

)v|,...xx,P(x

))...)

.......

))

)

Naive Bayes

Page 14: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

14

• Assumption: feature values are independent given the target value

• Generative model:• First choose a value vj V according to P(vj )• For each vj : choose x1 x2 …, xn according to P(xk |vj )

))P(vv|,...xx,P(xargmax v jjn21VvMAP j

n

1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x

Naive Bayes

Page 15: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

15

• Assumption: feature values are independent given the target value

• Learning method: Estimate n|V| parameters and use them to compute the new value. (how to estimate?)

))P(vv|,...xx,P(xargmax v jjn21VvMAP j

n

1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x

Naive Bayes

Page 16: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

16

• Assumption: feature values are independent given the target value

• Learning method: Estimate n|V| parameters and use them to compute the new value.• This is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions)

• This is learning without trying to achieve consistency or even approximate consistency.

• Why does it work?

))P(vv|,...xx,P(xargmax v jjn21VvMAP j

n

1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x

Naive Bayes

Page 17: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

17

• Notice that the features values are conditionally independent, given the target value, and are not required to be independent.• Example: f(x,y)=xy over the product distribution defined by p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/2 The distribution is defined so that x and y are independent: p(x,y) = p(x)p(y) (Interpretation - for every value of x and y)• But, given that f(x,y)=0:

p(x=1|f=0) = p(y=1|f=0) = 1/3 p(x=1,y=1 | f=0) = 0

so x and y are not conditionally independent.

Conditional Independence

Page 18: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

18

• The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=1) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent.• What about unconditional independence ?

Conditional Independence

Page 19: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

19

• The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=0) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent.• What about unconditional independence ? p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0.5But, p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0

so x and y are not independent.

Conditional Independence

Page 20: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

20

Day Outlook Temperature Humidity Wind PlayTennis

1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No

i jijVvNB )v|P(x)P(vargmax v

j

Example

Page 21: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

21

• How do we estimate P(observation | v) ?

i ino}{yes,vNB v)|nobservatioP(xP(v)argmax v

Estimating Probabilities

Page 22: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

22

• Compute P(PlayTennis= yes); P(PlayTennis= no)=• Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers)• Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers)• Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers)• Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers)

i jijVvNB )v|P(x)P(vargmax v

j

Example

Page 23: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

23

• Compute P(PlayTennis= yes); P(PlayTennis= no)=• Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers)• Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers)• Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers)• Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers)

•Given a new instance: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

• Predict: PlayTennis= ?

i jijVvNB )v|P(x)P(vargmax v

j

Example

Page 24: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

24

•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36

P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5

P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206

i jijVvNB )v|P(x)P(vargmax v

j

Example

Page 25: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

25

•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36

P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | yes) = 1/5P(humidity = hi |yes) = 3/9 P(humidity = hi |yes) = 4/5P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5

P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206

What is we were asked about Outlook=OC ?

i jijVvNB )v|P(x)P(vargmax v

j

Example

Page 26: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

26

•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36

P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5

P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 P(no|instance) = 0.0206/(0.0053+0.0206)=0.795

i jijVvNB )v|P(x)P(vargmax v

j

Example

Page 27: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

27

• Notice that the naïve Bayes method seems to be giving a method for predicting rather than an explicit classifier.• In the case of two classes, v{0,1} we predict that v=1 iff:

i jijVvNB )v|P(x)P(vargmax v

j

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

n

1i jij

Naive Bayes: Two Classes

Page 28: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

28

• Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier.• In the case of two classes, v{0,1} we predict that v=1 iff:

i jijVvNB )v|P(x)P(vargmax v

j

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

n

1i jij

1q-(1q0)P(v

p-(1p1)P(v

0)v|1p(xq 1),v|1p(xp

ii

ii

x-1i

xij

x-1i

xij

iiii

n

i

n

i

1

1

)

)

:Denote

Naive Bayes: Two Classes

Page 29: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

29

•In the case of two classes, v{0,1} we predict that v=1 iff:

1)

q-1

q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1i

x

i

iij

n

1i

x

i

iij

n

1i

x-1i

xij

n

1i

x-1i

xij

i

i

ii

ii

Naive Bayes: Two Classes

Page 30: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

30

•In the case of two classes, v{0,1} we predict that v=1 iff:

1)

q-1

q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1i

x

i

iij

n

1i

x

i

iij

n

1i

x-1i

xij

n

1i

x-1i

xij

i

i

ii

ii

0)xq-1

qlog

p-1

p(log

q-1

p-1log

0)P(v

1)P(vlog

:iff 1vpredict we logarithm; Take

ii

i

ii

i

ii

i

j

j

Naïve Bayes: Two Classes

Page 31: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

31

•In the case of two classes, v{0,1} we predict that v=1 iff:

• •We get that the optimal Bayes behavior is given by a linear separator with

1)

q-1

q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1i

x

i

iij

n

1i

x

i

iij

n

1i

x-1i

xij

n

1i

x-1i

xij

i

i

ii

ii

0)xq-1

qlog

p-1

p(log

q-1

p-1log

0)P(v

1)P(vlog

:iff 1vpredict we logarithm; Take

ii

i

ii

i

ii

i

j

j

irrelevant is feature the and 0w then qp if

p-1

q-1

q

plog)

q-1

qlog

p-1

p(logw

iii

i

i

i

i

i

i

ii

ii

Naïve Bayes: Two Classes

Page 32: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

32

• We have not addressed the question of why does this Classifier perform well, given that the assumptions are unlikely to be satisfied.

• The linear form of the classifiers provides some hints.

Why does it work?

Page 33: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

33

•In the case of two classes we have that:

bxw)x |0P(v

)x |1P(vlog ii i

j

j

Naïve Bayes: Two Classes

Page 34: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

34

•In the case of two classes we have that:

•but since

•We get (plug in (2) in (1); some algebra):

•Which is simply the logistic (sigmoid) function used in the neural network representation.

bxw)x |0P(v

)x |1P(vlog ii i

j

j

)x |0P(v-1)x |1P(v jj

b)xwexp(-1

1)x |1P(v

ii ij

Naïve Bayes: Two Classes

Page 35: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

35

Another look at Naive Bayes

n32 x x x x1 n-1n43322 xx xx xx xx1

l)|Pr( 1

1 2 3 n

l)|Pr( n l)|Pr( 2

Pr(l)l

|} lj|j)(x, {|

|} l j 1(x) |j)(x, {|logrPlogrPlogc D

l],[D

l],[l],[ 0

ˆ/ˆ

(x)cargmax(x)prediction i

n

0il],[x0.1}{l i

|S|/|} lj|j)(x, {|logrPlogc Dl],[l],[ 00

ˆ

Graphical model. It encodes the NB independence

assumption in the edge structure (siblings are

independent given parents)

Linear Statistical Queries Model

Page 36: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

36

Hidden Markov Model (HMM)

HMM is a probabilistic generative model It models how an observed sequence is generated

Let’s call each position in a sequence a time step At each time step, there are two variables

Current state (hidden) Observation

Page 37: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

37

HMM

Elements Initial state probability P(s1) Transition probability P(st|st-1) Observation probability P(ot|st)

As before, the graphical model is an encoding of the independence assumptions

Consider POS tagging: O – words, S – POS tags

s1

o1

s2

o2

s3

o3

s4

o4

s5

o5

s6

o6

Page 38: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

38

HMM for Shallow Parsing

States: {B, I, O}

Observations: Actual words and/or part-of-speech tags

s1=B

o1

Mr.

s2=I

o2

Brown

s3=O

o3

blamed

s4=B

o4

Mr.

s5=I

o5

Bob

s6=O

o6

for

Page 39: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

39

HMM for Shallow Parsing

Given a sentences, we can ask what the most likely state sequence is

Initial state probability:P(s1=B),P(s1=I),P(s1=O)

Transition probabilty:P(st=B|st-1=B),P(st=I|st-1=B),P(st=O|st-1=B),P(st=B|st-1=I),P(st=I|st-1=I),P(st=O|st-1=I),…

Observation Probability:P(ot=‘Mr.’|st=B),P(ot=‘Brown’|st=B),…,P(ot=‘Mr.’|st=I),P(ot=‘Brown’|st=I),…,…

s1=B

o1

Mr.

s2=I

o2

Brown

s3=O

o3

blamed

s4=B

o4

Mr.

s5=I

o5

Bob

s6=O

o6

for

Page 40: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

40

Finding most likely state sequence in HMM (1)

P (sk;sk ¡ 1; :: :;s 1;ok;ok ¡ 1;: :: ;o 1)

= P (ok jok ¡ 1;ok ¡ 2; :: :;o 1; sk; sk ¡ 1;:: :;s 1)¢P (ok ¡ 1;ok ¡ 2; ::: ;o 1;sk; sk ¡ 1; :: :;s 1)

= P (ok jsk )¢P (ok ¡ 1;ok ¡ 2;: :: ;o 1;sk;sk ¡ 1; ::: ;s 1)= P (ok jsk )¢P (sk jsk ¡ 1;sk ¡ 2;: ::; s 1;ok ¡ 1;ok ¡ 2;:: :;o 1)

¢P (sk ¡ 1;sk ¡ 2; ::: ;s 1;ok ¡ 1;ok ¡ 2;: ::;o 1)= P (ok jsk )¢P (sk jsk ¡ 1)

¢P (sk ¡ 1;sk ¡ 2; ::: ;s 1;ok ¡ 1;ok ¡ 2;: ::;o 1)

= P (ok jsk )¢ [k ¡ 1Y

t =1P (st +1jst )¢P (ot jst )]¢P (s 1)

Page 41: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

41

Finding most likely state sequence in HMM (2)

argmaxsk;sk ¡ 1;:::;s 1P (sk;sk ¡ 1;: ::; s 1jok;ok ¡ 1; ::: ;o 1)

= argmaxsk;sk ¡ 1;:::;s 1P (sk; sk ¡ 1; ::: ;s 1;ok;ok ¡ 1;: ::;o 1)

P (ok;ok ¡ 1;: :: ;o 1)= argmaxsk;sk ¡ 1;:::;s 1P (sk;sk ¡ 1;: :: ;s 1;ok;ok ¡ 1;: ::;o 1)

= argmaxsk;sk ¡ 1;:::;s 1P (ok jsk )¢ [k ¡ 1Y

t =1P (st +1jst )¢P (ot jst )]¢P (s 1)

Page 42: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

42

Finding most likely state sequence in HMM (3)

A function of sk

= maxsk

P (ok jsk )¢ maxsk ¡ 1;:::;s 1

[k ¡ 1Y

t =1P (st +1jst )¢P (ot jst )]¢P (s 1)

= maxsk

P (ok jsk )¢ maxsk ¡ 1

[P (sk jsk ¡ 1)¢P (ok ¡ 1jsk ¡ 1 )]

¢ maxsk ¡ 2;:::;s 1

[k ¡ 2Y

t =1P (st +1jst )¢P (ot jst )]¢P (s 1)

= maxsk

P (ok jsk )¢ maxsk ¡ 1

[P (sk jsk ¡ 1)¢P (ok ¡ 1jsk ¡ 1 )]

¢ maxsk ¡ 2

[P (sk ¡ 1jsk ¡ 2 )¢P (ok ¡ 2jsk ¡ 2 )]¢:: :

¢ maxs 1

[P (s2js 1)¢P (o 1js 1 )]¢P (s 1)

maxsk;sk ¡ 1;:::;s1

P (okjsk)¢[k ¡ 1Y

t=1P (st+1 jst)¢P (otjst)]¢P (s1)

Page 43: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

43

Finding most likely state sequence in HMM (4)

Viterbi’s Algorithm Dynamic Programming

maxsk P (ok jsk )¢ maxsk ¡ 1[P (sk jsk ¡ 1)¢P (ok ¡ 1jsk ¡ 1 )]

¢ maxsk ¡ 2[P (sk ¡ 1jsk ¡ 2 )¢P (ok ¡ 2jsk ¡ 2 )]¢:::

¢ maxs2[P (s3js 2 )¢P (o2js 2 )]¢

¢ maxs 1[P (s2js 1)¢P (o 1js 1 )]¢P (s 1)

Page 44: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

44

Learning the Model

Estimate Initial state probability P (s1) Transition probability P(st|st-1) Observation probability P(ot|st)

Unsupervised Learning (states are not observed) EM Algorithm

Supervised Learning (states are observed; more common)

ML Estimate of above terms directly from data

Notice that this is completely analogues to the case of naive Bayes, and essentially all other models.

Page 45: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

45

Another view of Markov Models

x)|Pr(w)t|Pr(w iii

)t,...,t,t|Pr(t)t|Pr(t 1-1ii1ii1i Assumptions:

Prediction: predict tT that maximizest)|Pr(tt)|Pr(w)t|Pr(t 1ii-1i

Input: ) ( )...t:(w?),:(w),t:),...(wt:(wx 1i1ii-1i-1i11

t t -1i t 1i

w -1i iw w 1i

States:

Observations:

T

W

Page 46: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

46

Another View of Markov Models

As for NB: features are pairs and singletons of t‘s, w’s

|} tt|)t'(t, {|

|} tt' tt |)t'(t, {|logrPlogrPlogc

1

21D]t[1,

D]t,[t]t,t[t 121121

ˆ/ˆ

(x)cargmax(x)prediction t],[xT}{t i

]t,t[t 221c t]t,[c w 0c Otherwise, t],[ ]t,t[t 121

c

Only 3 active featuresOnly 3 active features

Input: ) ( )...t:(w?),:(w),t:),...(wt:(wx 1i1ii-1i-1i11

t t -1i t 1i

w -1i iw w 1i

States:

Observations:

T

W

This can be extended to an argmax that maximizes the prediction of the whole state sequence and computed, as before, via Viterbi.

Page 47: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

47

Learning with Probabilistic Classifiers

Learning Theory

We showed that probabilistic predictions can be viewed as predictions via Linear Statistical Queries Models.

The low expressivity explains Generalization+Robustness Is that all?

It does not explain why is it possible to (approximately) fit the data with these models. Namely, is there a reason to believe that these hypotheses minimize the empirical error on the sample?

In General, No. (Unless it corresponds to some probabilistic assumptions that hold).

|S| /|} lh(x)|Sx {|(h)ErrS

Page 48: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

48

Learning Protocol

LSQ hypotheses are computed directly, w/o assumptions on the underlying distribution:

- Choose features - Compute coefficients

Is there a reason to believe that an LSQ hypothesis

minimizes the empirical error on the sample?

In general, no. (Unless it corresponds to some probabilistic

assumptions that hold).

Page 49: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

49

Learning Protocol: Practice

LSQ hypotheses are computed directly: - Choose features - Compute coefficients

If hypothesis does not fit the training data - - Augment set of features

(Forget your original assumption)

Page 50: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

50

Example: probabilistic classifiers

n32 x x x x1 n-1n43322 xx xx xx xx1

Features are pairs and singletons of t‘s, w’s

Additional features are included

l)|Pr( 1

1 2 3 n

l)|Pr( n l)|Pr( 2

Pr(l)l

t t -1i t 1i

w -1i iw w 1i

States:

Observations:

T

W

If hypothesis does not fit the training data - augment set of features (forget assumptions)

Page 51: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

51

Why is it relatively easy to fit the data? Consider all distributions with the same

marginals

(E.g, a naïve Bayes classifier will predict the same regardless of which distribution generated the data.)

(Garg&Roth ECML’01):In most cases (i.e., for most such distributions),

the resulting predictor’s error is close to optimal classifier (that if given the correct distribution)

)Pr( i l|

Robustness of Probabilistic Predictors

Page 52: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

52

Summary: Probabilistic Modeling

Classifiers derived from probability density estimation models were viewed as LSQ

hypotheses.

Probabilistic assumptions: + Guiding feature selection but also - - Not allowing the use of more general

features.

k21 iiiiii xxxc(x)c ...

Page 53: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

53

A Unified Approach

Most methods blow up original feature space.

And make predictions using a linear representation over the new feature space

kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX

(x)ci

ijimaxarg

j

Note: Methods do not have to actually do that; But: they produce same decision as a hypothesis that does that. (Roth 98; 99,00)

Page 54: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

54

A Unified Approach

Most methods blow up original feature space.

And make predictions using a linear representation over the new feature space

kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX

(x)ci

ijimaxarg

j Probabilistic Methods Rule based methods

(TBL; decision lists;

exponentially decreasing weights)

Linear representation (SNoW;Perceptron;

SVM;Boosting) Memory Based Methods

(subset features)

Page 55: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

55

A Unified Approach

Most methods blow up original feature space.

And make predictions using a linear representation over the new feature space

kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX

(x)ci

ijimaxarg

j

Q 1: How are weights determined?Q 2: How is the new feature-space

determined? Implications? Restrictions?

Page 56: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

56

What’s Next?

(1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically?

Yes (2) If probabilistic hypotheses are actually like

other linear functions, can you actually train them similarly (that is, discriminatively)?

Yes. Single Classification: Logistics regression/Max Entropy HMM

(3) General Sequential Processes First, multi-class classification (what’s the relation?)

Page 57: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

57

Conditional models: Introduction Starting from early 200x conditional and discriminative

models become very popular in the NLP community Why?

They often give better accuracy than generative models It is often easier to integrate complex features in these models Generalization results suggest these methods

Does it makes sense to distinguish probabilistic conditional models and discriminative classifiers?

( I.e. models which output conditional probabilities vs methods learned to minimize the expected errors)

Usually not (Recall Zhang’s paper presented by Sarah) They often can be regarded as both (either optimizing

a bound on error or estimating parameters of some conditional model)

Page 58: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

58

Joint vs Conditional (Probabilistic View)

Joint (or generative) models model joint probability P(x,y) of both input

Conditional Models model P(y | x): There is no need to make assumption on the form of

P(x) Joint examples:

n-gram language models, Naive-Bayes, HMM, Probabilistic Context Free Grammars (PCFGs) in parsing

Conditional examples logistic regression, MaxEnt, perceptron, SNoW, etc

Many of the following slides are based/ taken from Klein & Manning tutorial at ACL 03

Page 59: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

59

Joint vs Conditional (Probabilistic View)

Bayesian Networks for conditional and generative models, recall:

circles (nodes) correspond to variables each node is dependent on its parents in the graph a table of probabilities corresponds to each node in

the graph

you can think of each node as a classifier which predicts its value given values of its parents

Page 60: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

60

Example: Word Sense Disambiguation

We can use the same features (i.e. edges in the graphical model are the same but direction may be different) and get very different results

Word Sense Disambiguation (supervised): the same features (i.e. the same class of functions) but different estimation method (Klein & Manning 02):

Page 61: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

61

Joint vs. Conditional

However, this is not always the case, depends on

How wrong the assumptions in the generative model are

Amount of the training data (we plan to discuss Ng & Jordan 03 paper about it)

Other reasons why we may want to use generative models

Sometimes easier to integrate non-labeled date in generative model (semi-supervised learning)

Often easier to estimate parameters (to train) ....

Page 62: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

62

Maximum Entropy Models

Known by different names: Max Ent , CRF (local vs. global normalization),

Logistic Regression, Log-linear models

Plan: Recall feature based models Max-entropy motivation for the model ML motivation (logistic regression) Learning (parameter estimation) Bound minimization view

Page 63: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

63

Features: Predicates on variables

Page 64: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

64

Example Tasks

Page 65: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

65

Conditional vs. Joint

Joint: given data {c, d} we want to construct model P(c,d)

Maximize likelihood In the simples case we just compute counts,

normalize and you have a model In fact not so simple: smoothing, etc Not always trivial counts, what if you believe that

there is some unknown variable h: P(c,d) = h{P(h, c,d)}

Not trivial counts any more... Conditional:

Maximize conditional likelihood (probabilistic view) Harder to do Closer to minimization of error

= 5

Page 66: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

66

Feature-Based Classifier

= 5

Page 67: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

67

Feature-Based Classifier

= 5

Page 68: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

68

Maximizing Conditional Likelihood

Given the model and the data find parameters which maximize conditional likelihood L(¸) = i{P(c | d,µ)}

For the log-linear case we consider it means finding ¸ to maximize:

= 5

Page 69: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

69

MaxEnt model

= 5

Page 70: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

70

Learining

= 5

Page 71: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

71

Learining

= 5

Page 72: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

72

Learining

= 5

Page 73: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

73

Learining

= 5

Page 74: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

74

Relation to Max Ent Principle

= 5

Recall we discussed MaxEnt Principle before: We need know the marginals and now we want to fill the table:

We argued that we want to have as uniform distribution as possible but which has the same feature expectation as we observed:

So, we want MaxEnt distribution

Page 75: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

75

Relation to Max Ent Principle

= 5

It is possible to show (see additional notes on the lectures page) that the MaxEnt distribution has log-linear form

And we already observed that (Conditional) ML training corresponds to finding such parameters that empirical distribution of features matches the distribution given by the model

So, they are the same!

Important: a log-linear model trained to maximize likelihood is equivalent to MaxEnt model with constraints on the expectations for the same set of features

Page 76: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

76

Relation to Max Ent Principle

= 5

Page 77: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

77

= 5

Relation to Max Ent Principle

Page 78: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

78

= 5

Examples

Page 79: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

79

= 5

Examples

Page 80: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

80

Properties of MaxEnt

Unlike Naive Bayes there is no double counting:

Even a frequent feature may have small weights if it co-occurs with some other feature(-s)

What if our counts are not reliable? Our prior believe can be that the function is

not spily (the weights are small):

min¸ f(¸) = ||¸|| / 2 + log{P(C | D,¸)}

= 5

Page 81: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

81

MaxEnt vs. Naive Bayes

= 5

Page 82: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

82

MaxEnt vs. Naive Bayes

= 5

Page 83: 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before

83

Summary

Today we reviewed max-ent framework Considered both ML and MaxEnt motivation Discussed HMMs (you can imagine

conditional models instead of HMMs in a similar way)

Next time we will probably start talking about the term project

You may want to start reading about Semantic Role Labeling and Dependency Parsing