III. Naïve Bayes Probability review

Instructor’s notes Ch.2 - Supervised Learning

III. Naïve Bayes (pp.70-72)

This is a short section in our text, but we are presenting more material in these notes.

Probability review

Definition of probability: The probability of an even E is the ratio of nr. of cases where E occurs to total nr. ofcases.

Note well: This applies only when all the cases are equally likely!

Example: We roll a pair of (fair) dice. Find the probability of the event E=”The sum is five”.

□ In the experiment above, calculate the probability of the events A=”The sum is odd”, B=”The sum is prime”.

Union (or) and intersection (and) of eventsCast a die and define these events: A = “Nr. dots is 1” B = “Nr. dots is odd”

Calculate P(A), P(B), P(A∩B), P(AUB).

□ In the two-dice experiment, calculate the probabilities of these events:A = ”The sum is <10 and second die is >4” B = “The sum is <10 or second die is >4”


Law of conditional probability: , or P(A given B) = P(A and B)/P(B) (*)

Interpretation: Since we know that B occurred, we “renormalize” by dividing by its probability.

Example: Assuming the dots in the figure are equally likely, calculate P(A|B), P(B|A). Do it two ways: fromscratch, with the definition of probability as a ratio, and by applying the formula of conditional proabability.

□ In the two-dice experiment, calculate the probability of the eventC = ”The sum is <10 given that the second die is >4”

Equation (*) can be used to calculate any of the 3 probabilities involved, knowing the other 2. In particular, thisform is very useful: P(A∩B) = P(A|B) P(B) = P(B|A) P(A) (**)


Bayes’ Theorem: (***)

□ Prove Bayes’ Theorem, using Eq. (**)

□ There is a big party on campus where all CS and Business majors are invited. One in ten Business majors areshy and six in ten CS majors are shy. We meet a student who is shy. Is it more likely for their major to be CS orBusiness? Technically, what is the probability for their major to be CS? (Please ponder for a minute ...)

Hint: We are missing an important piece of information: There are about 100 CS majors and 1000 Businessmajors in this University!

□ Solve the previous problem by using Bayes’ Theorem directly: (***)

Hint: First define the events A and B!

Solution:

A = The student we meet is a CS major. B = The student we meet is shy.

A|B = ???? B|A = ????

Can we calculate the probabilities of all events on the RHS of Bayes’ Theorem?


P(B|A) = P(shy, given that CS major) = 6/10 = 0.6

P(A) = P(CS major) = 100/(100+1000) = 0.091

P(B) = P(shy) = ???? ... Let us think back on what we did in the first solution:

Where do the “100” and “60” come from?

P(shy) = P(shy and CS) + P(shy and Business)

Now we apply conditional probability (**) to each term:

= P(shy, given CS)·P(CS) + P(shy, given Business)·P(Business) =

= 0.6 x 100 + 0.1 x 1000

Note: Since all probabilities involved have a denominator of 1,100, we only wrote the numerators!

The addition of proababilities operated above is so useful, it was enshrined as another theorem or law ofProbability Theory:

Law of Total Probability (LTP): P(B) = P(B|A)·P(A) + P(B|not A)·P(not A)

Note: P(Business) = P(not CS) = P(not A)

When using the LTP in the denominator of Bayes’ Theorem, we have this more detailed form of

Bayes’ Theorem: (****)

P(A) is called the prior, and P(A|B) the posterior probability of A. B is called evidence.

The ratio is the support B offers A.


□ Two cab companies serve a city: the Green company operates 85% of the cabs and the Blue company operates15% of the cabs. One of the cabs is involved in a hit-and-run accident at night, and a witness identifies the hit-and-run cab as a Blue cab. When the court tests the reliability of the witness under circumstances similar tothose on the night of the accident, (s)he correctly identifies the color of a cab 80% of the time and misidentifiesit the other 20% of the time. What is the probability that the cab involved in the accident was Blue, as stated bythe witness?

Hint: Use Bayes’ Theorem. Define the events A and B. -

For more practice:Wikipedia’s page for Bayes’ Theorem https://en.wikipedia.org/wiki/Bayes%27_theorem has three niceexamples - study them all:

Drug testsReliability of factory machinesIdentification of beetles

https://eli.thegreenplace.net/2018/conditional-probability-and-bayes-theorem/



https://en.wikipedia.org/wiki/Bayes'_theorem


Application: TEXT LEARNING

We have a number of messages written by several authors1. For simplicity, let us call the authors A1 and A2.Also for simplicity, the classification will not be based on all the words present, but on a (relatively small)subset of them2. In this example, let us consider only three magic words: foo, bar, and baz.

For each author, we calculate the probability of each word to appear in a message:

P(foo|A1) = 0.2 P(bar|A1) = 0.3 P(baz|A1) = 0.4

P(foo|A2) = 0.3 P(bar|A2) = 0.1 P(baz|A2) = 0.3

We also need to know the distribution of the messages between A1 and A2. Ideally, they are evenlydistributed: P(A1) = P(A2) = 0.5.

We now have a new message, whose author is unknown, either A1 or A2. We find which of the magic wordsare present in the message, for example foo and bar. We call {foo ∩ bar} a bag of words, because the order orproximity of the words are not considered. We would like to calculate the probabilities P(A1|foo ∩ bar) andP(A2|foo ∩ bar), because then we would predict that the author with the higher probability is the author ofthe message.

Mnemonic: The formulas are easy to remember with A for author and B for bag of words: P(A|B).

We apply Bayes’ Theorem (***) to find:

P(A1|foo ∩ bar) = P(foo ∩ bar|A1)P(A1)/P(foo ∩ bar), and similar for A2.

Note: In Bayes’ Theorem we only have one piece of evidence, but here we have two: foo and bar. What to do?

Here is where the naive in Naive Bayes comes into play: We assume that the words occur independently3, sowe can factorize the intersections:

P(foo ∩ bar|A1) = P(foo|A1)P(bar|A1) and similar for A2.

Combining the last two eqns. we have:

P(A1|foo ∩ bar) = P(foo|A1)P(bar|A1)P(A1)/P(foo ∩ bar) and similar for A2.

Since A1 and A2 have the same denominator, we don’t need it in order to establish which probability isgreater, so we further simplify the formulas to:

P(A1|foo ∩ bar) ~ P(foo|A1)P(bar|A1)P(A1)

P(A2|foo ∩ bar) ~ P(foo|A2)P(bar|A2)P(A2) (o)

□ With the numerical values shown at the beginning of this section, find out which author is more likely for themessage. -

1For example, the 85 Federalist Papers were written by Alexander Hamilton, James Madison, and John Jay.2Several studies on the disputed Federalist Papers are based on a set of 70 so-called function words.3In principle, if we had a large-enough corpus of messages from an author, we could estimate joint distributions for each combinationof words, but in practice the combinatorial explosion prevents us from doing so.


How about baz, or in general any magic word that is missing from the message? Their absence may also countas information. How do the formulas (o) change to take this into account?

□ Recalculate the probabilities from the problem above, taking into account that the word baz is missing.Which author is more likely now? -

Note: The BernoulliNB classifier from scikit-learn does take into account the probabilities of the missingfeatures!


Solutions:□ Prove Bayes’ Theorem, using Eq. (**)Multiply both sides by P(B), to get P(A|B) P(B) = P(B|A) P(A). According to (**), both sides are P(A∩B).

□ Two cab companies serve a city: the Green company operates 85% of the cabs and the Blue company operates15% of the cabs. One of the cabs is involved in a hit-and-run accident at night, and a witness identifies the hit-and-run cab as a Blue cab. When the court tests the reliability of the witness under circumstances similar tothose on the night of the accident, he correctly identifies the color of a cab 80% of the time and misidentifies itthe other 20% of the time. What is the probability that the cab involved in the accident was Blue, as stated bythe witness?

Hint: Use Bayes’ Theorem. Define the events A and B.

□ With the numerical values shown at the beginning of this section, find out which author is more likely for themessage.P(A1|foo ∩ bar) ~ P(foo|A1)P(bar|A1)P(A1) = 0.2*0.3*0.5 = 0.03P(A2|foo ∩ bar) ~ P(foo|A2)P(bar|A2)P(A2) = 0.3*0.1*0.5 = 0.015 A1 is more likely.

□ Recalculate the probabilities from the problem above, taking into account that the word baz is missing.Which author is more likely now?The 1st probability above is further multiplied by 1 - P(baz|A1): 0.03*0.6 = 0.018The 2nd probability above is further multiplied by 1 - P(baz|A2): 0.015*0.7 = 0.0105 A1 is still more likely.

▀


In our textbook, we are shown how to represent four messages (each message is a row) and words (eachfeature/column is a word) in vectorized form:

Let us use W0, W1, W2, and W3 for clarity.

The targets/classes are represented in the array y:

Let us use A0 and A1 for clarity.

We write code to count the number of occurrences of each word in each class, by summing each column(axis=0). The function np.unique returns the sorted unique elements of a numpy array:

Now we can convert the counts above to the probabilities we need for the Naive Bayes algorithm!

□ Finish calculating the missing probabilities above!

□ For easy reference, place all the probabilities obtained above in this table:


Since we are going to use these probabilities for multiplication, we can avoid the zeroes by adding one to alldenominators and numerators. This is called Laplace smoothing:

□ We have a new message: [1, 1, 0, 0]. Calculate the products for the Naive Bayes algorithm and decide whichauthor is more likely. Do it by using only the positive occurences.

For more practice: Use both positive and negative occurences.

For more practice: Programming Naive Bayes classification from scratch► Write a Python function that takes an array of four binary values (the message) as argument, and returnsthe prediction.

Hint: For this problem, it is sufficient to hard-code the probability table as a two-dimensional list-of-lists ornumpy array.


□ Create an array X for the four-word example, with 20 rows and 4 columns. Place 10 messages from A0 first,followed by 10 from A1. The vector y has 10 zeros (A0), followed by 10 ones (A1, or not A0).

Solution: Below is a CSV (comma-separated-variables) file, visualized with a spreadsheet editor (left) and with aplain-text editor (right). The name of the file is messages.csv.

The first 4 columns have the data in the array X, and the 5th has the data for y (targets).Because values are missing, we import data into a numpy array using genfromtxt:

Now we create and train a Naive Bayes classifier that implements the algorithm described above. It is calledBernoulli Naive Bayes:

The (smoothing) parameter alpha has a meaning in NB classification that is slightly different from regression,but similar in that it controls the complexity of the model:

If all the words appear in each class of the training set (as was the case in the example above), then nosmoothing is necessary.If, however, one word, e.g. W3, is missing from a class in the training set, e.g. A1, then the estimatedconditional probability is zero P(W3/A1) = 0. All future messages from A1 that happen to contain W3 willbe given a probability of zero, irrespective of any other words they contain! This is effectively noise: Due tothe accidental content of our sample, we are under the wrong impression that W3 never occurs in A1smessages. A model that attempts to model this accident is too complex, so alpha reduces this complexity.

To avoid the case described above, a constant alpha is added to all the counts.By default, alpha = 1 (Laplace smoothing).The NB classifiers in scikit-learn do not allow alpha = 0. Even if we give alpha a value of zero, it will beautomatically set to a very small value 10-10 that is practically equal to zero.


The feature counts calculated “manually” in the text code are avilable as an attrribute of the classifier:

and the numbers of datapoints in each class are also tallied automatically:

and the prior probabilities are stored in the classifier:

Due to the multiplicative nature of calculations in the Naive Bayes algorithm, the probabilities are stored inlogarithmic form - this way they can be added rather than multiplied. In our example, note that the result is-0.693..., which is simply the natural logarithm of 0.5, since both authors are equally represented.

As with all classifiers and regressors, a member function allows to calculate the score for an array of points.Since we used the entire dataset for training, let us find the training score:

□ What conclusion do we draw from the score above?

Underfitting (because the data set is too small!)

SKIP MultinomialNB and GaussianNB


Conclusions on Naive Bayes (BernoulliNB):

Strengths:

alpha is not as important as in regression, but it can still fine-tune the model.

Works well (efficiently) with large, sparse matrices X (more in the lab!)

Like linear models: fast to train and predict, easy to understand. On very large datasets, it is even faster totrain than a linear model!

Weaknesses/limitations:

Is used only when the features are binary (0 or 1), and specifically for classifying text.

Assumes independence of features, which may not be the case in real-life, e.g. the feature overcast (Y/N) isprobably correlated with temperature (High/Low).

Data scarcity ... can be mitigated using smoothing (alpha).

▀


Solutions:□ We have a new message: [1, 1, 0, 0]. Calculate the products for the Naive Bayes algorithm and decide whichauthor is more likely. Do it by using only the positive occurences:

Conclusion: not A0, a.k.a. A1 is more likely.For more practice: Use both positive and negative occurences.

Hint: The two authors turn out to be equally likely!

▀

Documents

III. Naïve Bayes Probability review