29
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? Can get diagnostic probability P(Cavity | Toothache) from causal probability P(Toothache | Cavity) Can update our beliefs based on evidence Key tool for probabilistic inference Rev. Thomas Bayes (1702-1761) ) ( ) | ( ) ( ) | ( ) , ( A P A B P B P B A P B A P ) ( ) ( ) | ( ) | ( B P A P A B P B A P

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Embed Size (px)

Citation preview

Page 1: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bayes Rule

• The product rule gives us two ways to factor a joint probability:

• Therefore,

• Why is this useful?– Can get diagnostic probability P(Cavity | Toothache) from causal

probability P(Toothache | Cavity)– Can update our beliefs based on evidence– Key tool for probabilistic inference

)()|()()|(),( APABPBPBAPBAP

)(

)()|()|(

BP

APABPBAP

Rev. Thomas Bayes(1702-1761)

Page 2: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Page 3: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

)Predict(

)Rain()Rain|Predict()Predict|Rain(

P

PPP

111.00986.00126.0

0126.0

986.01.0014.09.0

014.09.0

)Rain()Rain|Predict()Rain()Rain|Predict(

)Rain()Rain|Predict(

PPPP

PP

Page 4: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bayes rule: Another example• 1% of women at age forty who participate in routine

screening have breast cancer.  80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies.  A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?

)Positive(

)Cancer()Cancer|Positive()Positive|Cancer(

P

PPP

0776.0095.0008.0

008.0

99.0096.001.08.0

01.08.0

)Cancer()Cancer|Positive()Cancer()Cancer|Positive(

)Cancer()Cancer|Positive(

PPPP

PP

Page 5: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Independence

• Two events A and B are independent if and only if P(A B) = P(A) P(B)– In other words, P(A | B) = P(A) and P(B | A) = P(B)– This is an important simplifying assumption for

modeling, e.g., Toothache and Weather can be assumed to be independent

• Are two mutually exclusive events independent?– No, but for mutually exclusive events we have

P(A B) = P(A) + P(B)

• Conditional independence: A and B are conditionally independent given C iff P(A B | C) = P(A | C) P(B | C)

Page 6: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Conditional independence: Example

• Toothache: boolean variable indicating whether the patient has a toothache

• Cavity: boolean variable indicating whether the patient has a cavity• Catch: whether the dentist’s probe catches in the cavity

• If the patient has a cavity, the probability that the probe catches in it doesn't depend on whether he/she has a toothache

P(Catch | Toothache, Cavity) = P(Catch | Cavity)• Therefore, Catch is conditionally independent of Toothache given Cavity• Likewise, Toothache is conditionally independent of Catch given Cavity

P(Toothache | Catch, Cavity) = P(Toothache | Cavity)• Equivalent statement:• P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

Page 7: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Conditional independence: Example

• How many numbers do we need to represent the joint probability table P(Toothache, Cavity, Catch)? 23 – 1 = 7 independent entries

• Write out the joint distribution using chain rule:

P(Toothache, Catch, Cavity)

= P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity)

= P(Cavity) P(Catch | Cavity) P(Toothache | Cavity)

• How many numbers do we need to represent these distributions? 1 + 2 + 2 = 5 independent numbers

• In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n

Page 8: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Probabilistic inference• Suppose the agent has to make a decision about

the value of an unobserved query variable X given some observed evidence variable(s) E = e – Partially observable, stochastic, episodic environment– Examples: X = {spam, not spam}, e = email message

X = {zebra, giraffe, hippo}, e = image features

• Bayes decision theory:– The agent has a loss function, which is 0 if the value of

X is guessed correctly and 1 otherwise– The estimate of X that minimizes expected loss is the one

that has the greatest posterior probability P(X = x | e)– This is the Maximum a Posteriori (MAP) decision

Page 9: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

MAP decision• Value x of X that has the highest posterior

probability given the evidence E = e:

• Maximum likelihood (ML) decision:

)|(maxarg* xePx x

)()|()|( xPxePexP likelihood priorposterior

)(

)()|()|(maxarg*

eEP

xXPxXeEPeExXPx x

)()|(maxarg xXPxXeEPx

Page 10: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

– If each feature Fi can take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

)()|,,(),,|( 11 XPXEEPEEXP nn

Page 11: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:

– If each feature can take on k values, what is the complexity of storing the resulting distributions?

)()|,,(),,|( 11 XPXEEPEEXP nn

n

iin XEPXEEP

11 )|()|,,(

Page 12: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes Spam Filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

Page 13: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes Spam Filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

• We have P(spam | message) P(message | spam)P(spam) and ¬P(spam | message) P(message | ¬spam)P(¬spam)

Page 14: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes Spam Filter• We need to find P(message | spam) P(spam) and

P(message | ¬spam) P(¬spam)• The message is a sequence of words (w1, …, wn)

• Bag of words representation– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

Page 15: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes Spam Filter• We need to find P(message | spam) P(spam) and

P(message | ¬spam) P(¬spam)• The message is a sequence of words (w1, …, wn)

• Bag of words representation– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

• Our filter will classify the message as spam if

n

iin spamwPspamwwPspammessageP

11 )|()|,,()|(

n

ii

n

ii spamwPspamPspamwPspamP

11

)|()()|()(

Page 16: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bag of words illustration

US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

Page 17: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bag of words illustration

US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

Page 18: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bag of words illustration

US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

Page 19: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Naïve Bayes Spam Filter

n

iin spamwPspamPwwspamP

11 )|()(),,|(

prior likelihoodposterior

Page 20: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Parameter estimation• In order to classify a message, we need to know the prior

P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)– These are the parameters of the probabilistic model– How do we obtain the values of these parameters?

spam: 0.33

¬spam: 0.67

P(word | ¬spam)P(word | spam)prior

Page 21: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Parameter estimation• How do we obtain the prior P(spam) and the likelihoods

P(word | spam) and P(word | ¬spam)?– Empirically: use training data

– This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data:

P(word | spam) =# of word occurrences in spam messages

total # of words in spam messages

D

d

n

iidid

d

classwP1 1

,, )|(

d: index of training document, i: index of a word

Page 22: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Parameter estimation• How do we obtain the prior P(spam) and the likelihoods

P(word | spam) and P(word | ¬spam)?– Empirically: use training data

• Parameter smoothing: dealing with words that were never seen or seen too few times– Laplacian smoothing: pretend you have seen every vocabulary word

one more time than you actually did

P(word | spam) =# of word occurrences in spam messages

total # of words in spam messages

P(word | spam) =# of word occurrences in spam messages + 1

total # of words in spam messages + V

(V: total number of unique words)

Page 23: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Summary of model and parameters• Naïve Bayes model:

• Model parameters:

n

ii

n

ii

spamwPspamPmessagespamP

spamwPspamPmessagespamP

1

1

)|()()|(

)|()()|(

P(spam)

P(¬spam)

P(w1 | spam)

P(w2 | spam)

P(wn | spam)

P(w1 | ¬spam)

P(w2 | ¬spam)

P(wn | ¬spam)

Likelihoodof spam

prior

Likelihoodof ¬spam

Page 24: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bag-of-word models for images

Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)

Page 25: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bag-of-word models for images

1. Extract image features

Page 26: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bag-of-word models for images

1. Extract image features

Page 27: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

1. Extract image features

2. Learn “visual vocabulary”

Bag-of-word models for images

Page 28: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

1. Extract image features

2. Learn “visual vocabulary”

3. Map image features to visual words

Bag-of-word models for images

Page 29: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Bayesian decision making: Summary

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1,e1), …, (xn,en)}