24
November 2004 CSA4050: Crash Concepts in Pr obability 1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence Bayes’ Rule Conditional Probability/Chain Rule

CSA4050: Advanced Topics in NLP

Embed Size (px)

DESCRIPTION

CSA4050: Advanced Topics in NLP. Probability I Experiments/Outcomes/Events Independence/Dependence Bayes’ Rule Conditional Probability/Chain Rule. Acknowledgement. Much of this material is based on material by Mary Dalrymple, Kings College, London. Experiment, Basic Outcome, Sample Space. - PowerPoint PPT Presentation

Citation preview

Page 1: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

1

CSA4050:Advanced Topics in NLP

Probability IExperiments/Outcomes/EventsIndependence/DependenceBayes’ RuleConditional Probability/Chain Rule

Page 2: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

2

Acknowledgement

Much of this material is based on material by Mary Dalrymple, Kings College, London

Page 3: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

3

Experiment, Basic Outcome,Sample Space Probability theory is founded upon the notion of an

experiment. An experiment is a situation which can have one or

more different basic outcomes. Example: if we throw a die, there are six possible

basic outcomes. A Sample Space Ω is a set of all possible basic

outcomes. For example, If we toss a coin, Ω = {H,T} If we toss a coin twice, Ω = {HT,TH,TT,HH} if we throw a die, Ω = {1,2,3,4,5,6}

Page 4: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

4

Event

An Event A Ω is a set of basic outcomes e.g. tossing two heads {HH} throwing a 6, {6} getting either a 2 or a 4, {2,4}.

Ω itself is the certain event, whilst { } is the impossible event.

Event Space ≠ Sample Space

Page 5: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

5

Probability distribution

A probability distribution of an experiment is a function that assigns a number (or probability) between 0 and 1 to each basic outcome such that the sum of all the probabilities = 1.

The probability p(E) of an event E is the sum of the probabilities of all the basic outcomes in E.

Uniform distribution is when each basic outcome is equally likely.

Page 6: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

6

Probability of an Event: die example Sample space = set of basic outcomes =

{1,2,3,4,5,6} If the die is not loaded, distribution is uniform. Thus for each basic outcome, e.g. {6}

(throwing a six) is assigned the same probability = 1/6.

So p({3,6}) = p({3}) + p({6}) = 2/6 = 1/3

Page 7: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

7

Estimating Probability

Repeat experiment T times and count frequency of E.

Estimated p(E) = count(E)/count(T) This can be done over m runs, yielding

estimates p1(E),...pm(E). Best estimate is (possibly weighted) average

of individual pi(E)

Page 8: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

8

3 times coin toss

Ω = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT} Cases with exactly 2 tails = {HTT, THT,TTH} Experimenti = 1000 cases (3000 tosses).

c1(E)= 386, p1(E) = .386

c2(E)= 375, p2(E) = .375

pmean(E)= (.386+.375)/2 = .381

Uniform distribution is when each basic outcome is equally likely.

Assuming uniform distribution, p(E) = 3/8 = .375

Page 9: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

9

Word Probability

General Problem:What is the probability of the next word/character/phoneme in a sequence, given the first N words/characters/phonemes.

To approach this problem we study an experiment whose sample space is the set of possible words.

N.B. The same approach could be used to study the the probability of the next character or phoneme.

Page 10: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

10

Word Probability

Approximation 1: all words are equally probable

Then probability of each word = 1/N where N is the number of word types.

But all words are not equally probable Approximation 2: probability of each word is

the same as its frequency of occurrence in a corpus.

Page 11: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

11

Word Probability

Estimate p(w) - the probability of word w: Given corpus C

p(w) count(w)/size(C) Example

Brown corpus: 1,000,000 tokens the: 69,971 tokens Probability of the: 69,971/1,000,000 .07 rabbit: 11 tokens Probability of rabbit: 11/1,000,000 .00001 conclusion: next word is most likely to be the

Is this correct?

Page 12: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

12

A counter example

Given the context: Look at the cute ... is the more likely than rabbit? Context matters in determining what word

comes next. What is the probability of the next word in a

sequence, given the first N words?

Page 13: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

13

Independent Events

A: eggs B: monday

sample space

Page 14: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

14

Sample Space

(eggs,mon) (cereal,mon) (nothing,mon)

(eggs,tue) (cereal,tue) (nothing,tue)

(eggs,wed) (cereal,wed) (nothing,wed)

(eggs,thu) (cereal,thu) (nothing,thu)

(eggs,fri) (cereal,fri) (nothing,fri)

(eggs,sat) (cereal,sat) (nothing,sat)

(eggs,sun) (cereal,sun) (nothing,sun)

Page 15: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

15

Independent Events

Two events, A and B, are independent if the fact that A occurs does not affect the probability of B occurring.

When two events, A and B, are independent, the probability of both occurring p(A,B) is the product of the prior probabilities of each, i.e.

p(A,B) = p(A) ·  p(B)

Page 16: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

16

Dependent Events

Two events, A and B, are dependent if the occurrence of one affects the probability of the occurrence of the other.

Page 17: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

17

Dependent Events

A B

sample space

A B

Page 18: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

18

Conditional Probability

The conditional probability of an event A given that event B has already occurred is written p(A|B)

In general p(A|B) p(B|A)

Page 19: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

19

Dependent Events: p(A|B)≠ p(B|A)

A

B

sample space

A B

Page 20: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

20

Example Dependencies

Consider fair die example with A = outcome divisible by 2 B = outcome divisible by 3 C = outcome divisible by 4

p(A|B) = p(A B)/p(B) = (1/6)/(1/3) = ½ p(A|C) = p(A C)/p(C) = (1/6)/(1/6) = 1

Page 21: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

21

Conditional Probability

Intuitively, after B has occurred, event A is replaced by A B, the sample space Ω is replaced by B, and probabilities are renormalised accordingly

The conditional probability of an event A given that B has occurred (p(B)>0) is thus given by p(A|B) = p(A B)/p(B).

If A and B are independent,p(A B) = p(A) · p(B) sop(A|B) = p(A) · p(B) /p(B) = p(A).

Page 22: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

22

Bayesian Inversion

• For A and B to occur, either B must occur first, then B, or vice versa. We get the following possibilites:p(A|B) = p(A B)/p(B)p(B|A) = p(A B)/p(A)

• Hence p(A|B) p(B) = p(B|A) p(A)• We can thus express p(A|B) in terms of p(B|A)• p(A|B) = p(B|A) p(A)/p(B)• This equivalence, known as Bayes’ Theorem, is

useful when one or other quantity is difficult to determine

Page 23: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

23

Bayes’ Theorem

p(B|A) = p(BA)/p(A) = p(A|B) p(B)/p(A) The denominator p(A) can be ignored if we

are only interested in which event out of some set is most likely.

Typically we are interested in the value of B that maximises an observation A, i.e.

arg maxB p(A|B) p(B)/p(A) = arg maxB p(A|B) p(B)

Page 24: CSA4050: Advanced Topics in NLP

November 2004 CSA4050: Crash Concepts in Probability

24

The Chain Rule

We can use the definition of conditional probability to more than two events

p(A1 ... An) = p(A1) * p(A2|A1) * p(A3|A1 A2)..., p(An|A1 ... An-1)

The chain rule allows us to talk about the probability of sequences of events p(A1,...,An).