110
Generative Models; Naive Bayes Dan Roth [email protected]|http://www.cis.upenn.edu/~danroth/|461C, 3401 Walnut 1 Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), and other authors who have made their ML slides available.

Generative Models; Naive Bayes

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Generative Models; Naive Bayes

Dan Roth [email protected]|http://www.cis.upenn.edu/~danroth/|461C, 3401 Walnut

1

Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), and other authors who have made their ML slides available.

Page 2: Generative Models; Naive Bayes

CIS 419/519 Fall’20 2

Administration (11/23/20)β€’ Remember that all the lectures are available on the website before the class

– Go over it and be prepared– A new set of written notes will accompany most lectures, with some more details, examples

and, (when relevant) some code.

β€’ HW4 is out – NNs and Bayesian Learning– Due 12/3 12/7 (Monday)

β€’ No Lecture on Wednesday (Friday Schedule)– Happy Thanksgiving!

β€’ Projects– Most of you have chosen a project and a team. – You should have received an email from one of the TAs that will help guide your project

β€’ We have dates for the Final and the Project Submission; check the class’ website

Are we recording? YES!

Available on the web site

Page 3: Generative Models; Naive Bayes

CIS 419/519 Fall’20 3

Projectsβ€’ CIS 519 students need to do a team project: Read the project descriptions and follow the updates on the Project webpage

– Teams will be of size 2-4– We will help grouping if needed

β€’ There will be 3 options for projects. – Natural Language Processing (Text)– Computer Vision (Images)– Speech (Audio)

β€’ In all cases, we will give you datasets and initial ideas– The problem will be multiclass classification problems– You will get annotated data only for some of the labels, but will also have to predict other labels– 0-zero shot learning; few-shot learning; transfer learning

β€’ A detailed note will come out today.

β€’ Timeline:– 11/11 Choose a project and team up– 11/23 Initial proposal describing what your team plans to do – 12/2 Progress report– 12/15-20 (TBD) Final paper + short video

β€’ Try to make it interesting!

Page 4: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Recap: Error Driven (discriminative) Learning β€’ Consider a distribution 𝐷𝐷 over space π‘Ώπ‘ΏΓ—π‘Œπ‘Œβ€’ 𝑿𝑿 - the instance space; π‘Œπ‘Œ - set of labels. (e.g. Β±1)

β€’ Can think about the data generation process as governed by 𝐷𝐷(𝒙𝒙), and the labeling process as governed by 𝐷𝐷(𝑦𝑦|𝒙𝒙), such that

𝐷𝐷(𝒙𝒙, 𝑦𝑦) = 𝐷𝐷(𝒙𝒙) 𝐷𝐷(𝑦𝑦|𝒙𝒙)

β€’ This can be used to model both the case where labels are generated by a function 𝑦𝑦 = 𝑓𝑓(𝒙𝒙), as well as noisy cases and probabilistic generation of the label.

β€’ If the distribution 𝐷𝐷 is known, there is no learning. We can simply predict 𝑦𝑦 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝐷𝐷(𝑦𝑦|𝒙𝒙)

β€’ If we are looking for a hypothesis, we can simply find the one that minimizes the probability of mislabeling:

β„Ž = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Ž 𝐸𝐸(𝒙𝒙,𝑦𝑦) ~ 𝐷𝐷[[β„Ž 𝒙𝒙 β‰  𝑦𝑦]]

Page 5: Generative Models; Naive Bayes

CIS 419/519 Fall’20 5

Recap: Error Driven Learning (2)β€’ Inductive learning comes into play when the distribution is not known. β€’ Then, there are two basic approaches to take.

– Discriminative (Direct) Learning

β€’ and – Bayesian Learning (Generative)

β€’ Running example: – Text Correction:

β€’ β€œI saw the girl it the park” I saw the girl in the park

Page 6: Generative Models; Naive Bayes

CIS 419/519 Fall’20 6

1: Direct Learningβ€’ Model the problem of text correction as a problem of learning from examples.β€’ Goal: learn directly how to make predictions.

PARADIGMβ€’ Look at many (positive/negative) examples.β€’ Discover some regularities in the data.β€’ Use these to construct a prediction policy.

β€’ A policy (a function, a predictor) needs to be specific.[it/in] rule: if β€œthe” occurs after the target β‡’in

β€’ Assumptions comes in the form of a hypothesis class.

Bottom line: approximating β„Ž ∢ 𝑿𝑿 β†’ π‘Œπ‘Œ is estimating 𝑃𝑃(π‘Œπ‘Œ|𝑿𝑿).

Page 7: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ Consider a distribution 𝐷𝐷 over space π‘Ώπ‘ΏΓ—π‘Œπ‘Œβ€’ 𝑿𝑿 - the instance space; π‘Œπ‘Œ - set of labels. (e.g. Β±1)β€’ Given a sample 𝒙𝒙, 𝑦𝑦 1

π‘šπ‘š,, and a loss function 𝐿𝐿(𝒙𝒙, 𝑦𝑦)

β€’ Find β„Žβˆˆπ»π» that minimizes βˆ‘π‘–π‘–=1,π‘šπ‘šπ·π·(𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖)𝐿𝐿(β„Ž(π’™π’™π‘Žπ‘Ž),π‘¦π‘¦π‘Žπ‘Ž) + π‘…π‘…π‘…π‘…π‘Žπ‘Ž

𝐿𝐿 can be: 𝐿𝐿(β„Ž(𝒙𝒙), 𝑦𝑦) = 1, β„Ž(𝒙𝒙)≠𝑦𝑦, otherwise 𝐿𝐿 β„Ž 𝒙𝒙 , 𝑦𝑦 = 0 (0-1 loss)

𝐿𝐿(β„Ž(𝒙𝒙), 𝑦𝑦) = β„Ž 𝒙𝒙 βˆ’ 𝑦𝑦 2 (L2 loss)

𝐿𝐿(β„Ž(𝒙𝒙), 𝑦𝑦) = max{0,1 βˆ’ 𝑦𝑦 β„Ž(𝒙𝒙)} (hinge loss)

𝐿𝐿(β„Ž(𝒙𝒙),𝑦𝑦) = exp{βˆ’ π‘¦π‘¦β„Ž(𝒙𝒙)} (exponential loss)

β€’ Guarantees: If we find an algorithm that minimizes loss on the observed data, then learning theory guarantees good future behavior (as a function of |𝐻𝐻|).

7

Direct Learning (2)

Page 8: Generative Models; Naive Bayes

CIS 419/519 Fall’20

The model is called β€œgenerative” since it makes an assumption on how data 𝑿𝑿 is generated given 𝑦𝑦

8

2: Generative Modelβ€’ Model the problem of text correction as that of generating correct sentences.β€’ Goal: learn a model of the language; use it to predict.

PARADIGMβ€’ Learn a probability distribution over all sentences

– In practice: make assumptions on the distribution’s typeβ€’ Use it to estimate which sentence is more likely.

– Pr 𝐼𝐼 π‘ π‘ π‘Žπ‘Žπ‘ π‘  π‘‘π‘‘β„Žπ‘…π‘… π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘” π‘Žπ‘Žπ‘‘π‘‘ π‘‘π‘‘β„Žπ‘…π‘… π‘π‘π‘Žπ‘Žπ‘Žπ‘Žπ‘π‘ >< Pr(𝐼𝐼 π‘ π‘ π‘Žπ‘Žπ‘ π‘  π‘‘π‘‘β„Žπ‘…π‘… π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘” π‘Žπ‘Žπ‘Žπ‘Ž π‘‘π‘‘β„Žπ‘…π‘… π‘π‘π‘Žπ‘Žπ‘Žπ‘Žπ‘π‘)– In practice: a decision policy depends on the assumptions

β€’ Guarantees: Provided if we assume the β€œright” family of probability distributions

Bottom line: the generating paradigm approximates 𝑃𝑃(𝑿𝑿,π‘Œπ‘Œ) = 𝑃𝑃(𝑿𝑿|π‘Œπ‘Œ) 𝑃𝑃(π‘Œπ‘Œ)

Page 9: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ Note that there are two different notions.β€’ Learning probabilistic concepts

– The learned concept is a function h:𝐗𝐗→[0,1]– h(𝒙𝒙) may be interpreted as the probability that the label 1 is assigned to 𝒙𝒙– The learning theory that we have studied before is applicable (with some

extensions).β€’ Bayesian Learning: Use of a probabilistic criterion in selecting a hypothesis

– The hypothesis can be deterministic, a Boolean function.β€’ It’s not the hypothesis – it’s the process.β€’ In practice, as we’ll see, we will use the same principles as before, often

similar justifications, and similar algorithms.9

Probabilistic Learning

Remind yourself of basic probability

Page 10: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Probabilitiesβ€’ 30 years of AI research danced around the fact that the world was

inherently uncertainβ€’ Bayesian Inference:

– Use probability theory and information about independence – Reason diagnostically (from evidence (effects) to conclusions (causes))...– ...or causally (from causes to effects)

β€’ Probabilistic reasoning only gives probabilistic results– i.e., it summarizes uncertainty from various sources

β€’ We will only use it as a tool in Machine Learning.

Page 11: Generative Models; Naive Bayes

CIS 419/519 Fall’20 11

Conceptsβ€’ Probability, Probability Space and Eventsβ€’ Joint Eventsβ€’ Conditional Probabilitiesβ€’ Independence

– This week’s recitation will provide a refresher on probability– Use the material we provided on-line

β€’ Next I will give a very quick refresher

Page 12: Generative Models; Naive Bayes

CIS 419/519 Fall’20 12

(1) Discrete Random Variablesβ€’ Let 𝑋𝑋 denote a random variable

– 𝑋𝑋 is a mapping from a space of outcomes to 𝑅𝑅.β€’ [𝑋𝑋 = π‘Žπ‘Ž] represents an event (a possible outcome) and, β€’ Each event has an associated probability

β€’ Examples of binary random variables:– 𝐴𝐴 = I have a headache– 𝐴𝐴 = Sally will be the US president in 2020

β€’ 𝑃𝑃(𝐴𝐴 = π‘‡π‘‡π‘Žπ‘Žπ‘‡π‘‡π‘…π‘…) is β€œthe fraction of possible worlds in which 𝐴𝐴 is true”– We could spend hours on the philosophy of this, but we won’t

Adapted from slide by Andrew Moore

Page 13: Generative Models; Naive Bayes

CIS 419/519 Fall’20 13

Visualizing Aβ€’ Universe π‘ˆπ‘ˆ is the event space of all possible worlds

– Its area is 1– 𝑃𝑃(π‘ˆπ‘ˆ) = 1

β€’ 𝑃𝑃(𝐴𝐴) = area of red ovalβ€’ Therefore:

𝑃𝑃 𝐴𝐴 + 𝑃𝑃 ¬𝐴𝐴 = 1𝑃𝑃 ¬𝐴𝐴 = 1 βˆ’ 𝑃𝑃 𝐴𝐴

π‘ˆπ‘ˆ

worlds in which 𝐴𝐴 is false

worlds in which 𝐴𝐴 is true

Page 14: Generative Models; Naive Bayes

CIS 419/519 Fall’20 14

Axioms of Probabilityβ€’ Kolmogorov showed that three simple axioms lead to the rules of probability

theory– de Finetti, Cox, and Carnap have also provided compelling arguments for these axioms

1. All probabilities are between 0 and 1: 0 ≀ 𝑃𝑃(𝐴𝐴) ≀ 1

2. Valid propositions (tautologies) have probability 1, and unsatisfiable propositions have probability 0:

𝑃𝑃(π‘‘π‘‘π‘Žπ‘Žπ‘‡π‘‡π‘…π‘…) = 1 ; 𝑃𝑃(π‘“π‘“π‘Žπ‘Žπ‘”π‘”π‘ π‘ π‘…π‘…) = 0

3. The probability of a disjunction is given by:𝑃𝑃(𝐴𝐴 ∨ 𝐡𝐡) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐡𝐡) – 𝑃𝑃(𝐴𝐴 ∧ 𝐡𝐡)

Page 15: Generative Models; Naive Bayes

CIS 419/519 Fall’20 15

Interpreting the Axiomsβ€’ 0 ≀ 𝑃𝑃(𝐴𝐴) ≀ 1β€’ 𝑃𝑃(π‘‘π‘‘π‘Žπ‘Žπ‘‡π‘‡π‘…π‘…) = 1β€’ 𝑃𝑃(π‘“π‘“π‘Žπ‘Žπ‘”π‘”π‘ π‘ π‘…π‘…) = 0β€’ 𝑃𝑃(𝐴𝐴 ∨ 𝐡𝐡) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐡𝐡) – 𝑃𝑃(𝐴𝐴 ∧ 𝐡𝐡)

β€’ From these you can prove other properties:β€’ 𝑃𝑃 ¬𝐴𝐴 = 1 βˆ’ 𝑃𝑃 𝐴𝐴‒ 𝑃𝑃 𝐴𝐴 = 𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 + 𝑃𝑃(𝐴𝐴 ∧ ¬𝐡𝐡)

𝐴𝐴∧𝐡𝐡𝐴𝐴 𝐡𝐡

Page 16: Generative Models; Naive Bayes

CIS 419/519 Fall’20 16

Multi-valued Random Variables

β€’ 𝐴𝐴 is a random variable with arity 𝑝𝑝 if it can take on exactly one value out of {𝑣𝑣1,𝑣𝑣2, … , 𝑣𝑣𝑝𝑝 }

β€’ Think about tossing a dieβ€’ Thus…

– 𝑃𝑃(𝐴𝐴 = 𝑣𝑣𝑖𝑖 ∧ 𝐴𝐴 = 𝑣𝑣𝑗𝑗 ) = 0 π‘Žπ‘Žπ‘“π‘“ π‘Žπ‘Ž β‰  𝑗𝑗– 𝑃𝑃(𝐴𝐴 = 𝑣𝑣1 ∨ 𝐴𝐴 = 𝑣𝑣2 ∨ β€¦βˆ¨ 𝐴𝐴 = π‘£π‘£π‘˜π‘˜ ) = 1

Based on slide by Andrew Moore

1 = �𝑖𝑖=1

π‘˜π‘˜

𝑃𝑃(𝐴𝐴 = 𝑣𝑣𝑖𝑖)

Page 17: Generative Models; Naive Bayes

CIS 419/519 Fall’20 17

Multi-valued Random Variables

β€’ We can also show that:𝑃𝑃 𝐡𝐡 = 𝑃𝑃( 𝐡𝐡 ∧ 𝐴𝐴 = 𝑣𝑣1 ∨ 𝐴𝐴 = 𝑣𝑣2 ∨ β‹―βˆ¨ 𝐴𝐴 = π‘£π‘£π‘˜π‘˜ )

β€’ This is called marginalization over A

P(𝐡𝐡) = �𝑖𝑖=1

π‘˜π‘˜

𝑃𝑃(𝐡𝐡 ∧ 𝐴𝐴 = 𝑣𝑣𝑖𝑖)

Page 18: Generative Models; Naive Bayes

CIS 419/519 Fall’20

(2) Joint Probabilitiesβ€’ Joint probability: matrix of combined probabilities of a set of

variables

Russell & Norvig’s Alarm Domain: (boolean RVs)β€’ A world has a specific instantiation of variables:

(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž ∧ π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ ∧ Β¬π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘β„Žπ‘’π‘’π‘‡π‘‡π‘Žπ‘Žπ‘π‘π‘…π‘…)β€’ The joint probability is given by:

𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž, π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦) = π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž Β¬π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 0.09 0.01

Β¬π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 0.1 0.8

Probability of burglary:𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦) = 0.1

by marginalization over π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž

Page 19: Generative Models; Naive Bayes

CIS 419/519 Fall’20 19

The Joint Distributionβ€’ Recipe for making a joint

distribution of 𝑑𝑑 variables:

Slide Β© Andrew Moore

e.g., Boolean variables A, B, C

Page 20: Generative Models; Naive Bayes

CIS 419/519 Fall’20 20

The Joint Distributionβ€’ Recipe for making a joint

distribution of 𝑑𝑑 variables:1. Make a truth table listing all

combinations of values of your variables (if there are 𝑑𝑑 Boolean variables then the table will have 2𝑑𝑑 rows).

A B C0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

Slide Β© Andrew Moore

e.g., Boolean variables A, B, C

Page 21: Generative Models; Naive Bayes

CIS 419/519 Fall’20 21

The Joint Distributionβ€’ Recipe for making a joint

distribution of 𝑑𝑑 variables:1. Make a truth table listing all

combinations of values of your variables (if there are 𝑑𝑑Boolean variables then the table will have 2𝑑𝑑 rows).

2. For each combination of values, say how probable it is.

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Slide Β© Andrew Moore

e.g., Boolean variables A, B, C

Page 22: Generative Models; Naive Bayes

CIS 419/519 Fall’20 22

The Joint Distributionβ€’ Recipe for making a joint

distribution of d variables:1. Make a truth table listing all

combinations of values of your variables (if there are dBoolean variables then the table will have 2d rows).

2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

A

B

C

0.050.25

0.100.05

0.05

0.10

0.100.30

e.g., Boolean variables A, B, C

Slide Β© Andrew Moore

Page 23: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Inferring Probabilities from the Joint

β€’ 𝑃𝑃 π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž = βˆ‘π‘π‘,𝑒𝑒 𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž ∧ π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ = 𝑏𝑏 ∧ π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘β„Žπ‘’π‘’π‘‡π‘‡π‘Žπ‘Žπ‘π‘π‘…π‘… = 𝑅𝑅)= 0.01 + 0.08 + 0.01 + 0.09 = 0.19

β€’ 𝑃𝑃 π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ = βˆ‘π‘Žπ‘Ž,𝑒𝑒 𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž = π‘Žπ‘Ž ∧ π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ ∧ π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘β„Žπ‘’π‘’π‘‡π‘‡π‘Žπ‘Žπ‘π‘π‘…π‘… = 𝑅𝑅)= 0.01 + 0.08 + 0.001 + 0.009 = 0.1

alarm Β¬alarmearthquake Β¬earthquake earthquake Β¬earthquake

burglary 0.01 0.08 0.001 0.009Β¬burglary 0.01 0.09 0.01 0.79

Page 24: Generative Models; Naive Bayes

CIS 419/519 Fall’20 24

(3) Conditional Probabilityβ€’ 𝑃𝑃(𝐴𝐴 | 𝐡𝐡) = Fraction of worlds in which 𝐡𝐡 is true that also

have A trueU

AB

What if we already know that 𝐡𝐡 is true?

That knowledge changes the probability of Aβ€’ Because we know we’re in a

world where B is true

𝑃𝑃 𝐴𝐴 𝐡𝐡 =𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡𝑃𝑃 𝐡𝐡

𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 = 𝑃𝑃 𝐴𝐴 𝐡𝐡 Γ— 𝑃𝑃(𝐡𝐡)

Page 25: Generative Models; Naive Bayes

CIS 419/519 Fall’20 25

Example: Conditional Probabilities

alarm Β¬alarmburglary 0.09 0.01Β¬burglary 0.1 0.8

𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ | π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž)

𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž | π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦)

𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ ∧ π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž)

= 𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ ∧ π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž) / 𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž)= 0.09 / 0.19 = 0.47

= 𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ ∧ π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž) / 𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦)= 0.09/0.1 = 0.9

= 𝑃𝑃(π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ | π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž) 𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž)= 0.47 βˆ— 0.19 = 0.09

𝑃𝑃 𝐴𝐴 𝐡𝐡 =𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡𝑃𝑃 𝐡𝐡

𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 = 𝑃𝑃 𝐴𝐴 𝐡𝐡 Γ— 𝑃𝑃(𝐡𝐡)

𝑃𝑃(π‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž, π‘π‘π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦) =

Page 26: Generative Models; Naive Bayes

CIS 419/519 Fall’20 26

(4) Independenceβ€’ When two event do not affect each others’ probabilities, we

call them independent β€’ Formal definition:

𝐴𝐴 β«« 𝐡𝐡 ↔ 𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 = 𝑃𝑃 𝐴𝐴 Γ— 𝑃𝑃 𝐡𝐡↔ 𝑃𝑃 𝐴𝐴 𝐡𝐡 = 𝑃𝑃(𝐴𝐴)

Page 27: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ Is smart independent of study?

β€’ Is prepared independent of study?

27

Exercise: Independence

P(smart ∧ study ∧ prep)smart ¬ smart

study Β¬ study study Β¬ study

prepared 0.432 0.16 0.084 0.008

Β¬prepared 0.048 0.16 0.036 0.072

Page 28: Generative Models; Naive Bayes

CIS 419/519 Fall’20 28

Exercise: Independence

Is smart independent of study?𝑃𝑃 𝑠𝑠𝑑𝑑𝑇𝑇𝑑𝑑𝑦𝑦 ∧ π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = 0.432 + 0.048 = 0.48𝑃𝑃 𝑠𝑠𝑑𝑑𝑇𝑇𝑑𝑑𝑦𝑦 = 0.432 + 0.048 + 0.084 + 0.036 = 0.6𝑃𝑃 π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = 0.432 + 0.048 + 0.16 + 0.16 = 0.8

𝑃𝑃(𝑠𝑠𝑑𝑑𝑇𝑇𝑑𝑑𝑦𝑦) Γ— 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘) = 0.6 Γ— 0.8 = 0.48Is prepared independent of study?

P(smart ∧ study ∧ prep)smart ¬smart

study Β¬ study study Β¬ study

prepared 0.432 0.16 0.084 0.008

Β¬prepared 0.048 0.16 0.036 0.072

So yes!

Page 29: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ Absolute independence of A and B:𝐴𝐴 β«« 𝐡𝐡 ↔ 𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 = 𝑃𝑃 𝐴𝐴 Γ— 𝑃𝑃 𝐡𝐡

↔ 𝑃𝑃 𝐴𝐴 𝐡𝐡 = 𝑃𝑃(𝐴𝐴)β€’ Conditional independence of 𝐴𝐴 and 𝐡𝐡 given 𝐢𝐢

β€’ This lets us decompose the joint distribution:– Conditional independence is different than absolute independence, but

still useful in decomposing the full joint– 𝑃𝑃(𝐴𝐴,𝐡𝐡,𝐢𝐢) = [Always] 𝑃𝑃(𝐴𝐴,𝐡𝐡|𝐢𝐢) 𝑃𝑃(𝐢𝐢) = – = [Independence] 𝑃𝑃(𝐴𝐴|𝐢𝐢) 𝑃𝑃(𝐡𝐡|𝐢𝐢) 𝑃𝑃(𝐢𝐢)

𝐴𝐴 β«« 𝐡𝐡| C ↔ 𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 𝐢𝐢 = 𝑃𝑃 𝐴𝐴 𝐢𝐢 Γ— 𝑃𝑃(𝐡𝐡|𝐢𝐢)

29

Conditional Independence

Page 30: Generative Models; Naive Bayes

CIS 419/519 Fall’20 30

Exercise: Independence

β€’ Is smart independent of study?

β€’ Is prepared independent of study?

P(smart ∧ study ∧ prep)smart ¬ smart

study Β¬ study study Β¬ study

prepared 0.432 0.16 0.084 0.008

Β¬ prepared 0.048 0.16 0.036 0.072

Page 31: Generative Models; Naive Bayes

CIS 419/519 Fall’20 31

Take Home Exercise: Conditional independence

β€’ Is smart conditionally independent of prepared, given study?

β€’ Is study conditionally independent of prepared, given smart?

P(smart ∧ study ∧ prep)smart ¬ smart

study Β¬ study study Β¬ study

prepared 0.432 0.16 0.084 0.008

Β¬ prepared 0.048 0.16 0.036 0.072

Page 32: Generative Models; Naive Bayes

CIS 419/519 Fall’20 32

Summary: Basic Probabilityβ€’ Product Rule: 𝑃𝑃(𝐴𝐴,𝐡𝐡) = 𝑃𝑃(𝐴𝐴|𝐡𝐡)𝑃𝑃(𝐡𝐡) = 𝑃𝑃(𝐡𝐡|𝐴𝐴)𝑃𝑃(𝐴𝐴)β€’ If 𝐴𝐴 and 𝐡𝐡 are independent:

– 𝑃𝑃(𝐴𝐴,𝐡𝐡) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐡𝐡); 𝑃𝑃(𝐴𝐴|𝐡𝐡) = 𝑃𝑃(𝐴𝐴),𝑃𝑃(𝐴𝐴|𝐡𝐡,𝐢𝐢) = 𝑃𝑃(𝐴𝐴|𝐢𝐢)β€’ Sum Rule: P(A ∨ 𝐡𝐡) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐡𝐡) βˆ’ 𝑃𝑃(𝐴𝐴,𝐡𝐡)β€’ Bayes Rule: 𝑃𝑃(𝐴𝐴|𝐡𝐡) = 𝑃𝑃(𝐡𝐡|𝐴𝐴) 𝑃𝑃(𝐴𝐴)/𝑃𝑃(𝐡𝐡)β€’ Total Probability:

– If events 𝐴𝐴1,𝐴𝐴2, …𝐴𝐴𝑛𝑛 are mutually exclusive: 𝐴𝐴𝑖𝑖 ∧∩ 𝐴𝐴𝑗𝑗 = Ξ¦, βˆ‘π‘–π‘– 𝑃𝑃(𝐴𝐴𝑖𝑖) = 1– 𝑃𝑃 𝐡𝐡 = βˆ‘ 𝑃𝑃 𝐡𝐡 ,𝐴𝐴𝑖𝑖 = βˆ‘π‘–π‘– 𝑃𝑃(𝐡𝐡|𝐴𝐴𝑖𝑖) 𝑃𝑃(𝐴𝐴𝑖𝑖)

β€’ Total Conditional Probability: – If events 𝐴𝐴1,𝐴𝐴2, …𝐴𝐴𝑛𝑛 are mutually exclusive: 𝐴𝐴𝑖𝑖 ∧∩ 𝐴𝐴𝑗𝑗 = Ξ¦, βˆ‘π‘–π‘– 𝑃𝑃(𝐴𝐴𝑖𝑖) = 1– 𝑃𝑃 𝐡𝐡 𝐢𝐢 = βˆ‘ 𝑃𝑃 𝐡𝐡 ,𝐴𝐴𝑖𝑖 𝐢𝐢 = βˆ‘π‘–π‘– 𝑃𝑃(𝐡𝐡|𝐴𝐴𝑖𝑖 ,𝐢𝐢) 𝑃𝑃(𝐴𝐴𝑖𝑖|𝐢𝐢)

Page 33: Generative Models; Naive Bayes

CIS 419/519 Fall’20 33

Summary: The Monty Hall problemβ€’ Suppose you're on a game show, and you're

given the choice of three doors.– Behind one door is a car; behind the others,

goats. – You pick a door, say No. 1, and the host, who

knows what's behind the doors, opens another door, say No. 3, which has a goat.

– He then says to you, "Do you want to switch to door No. 2?"

β€’ Is it to your advantage to switch your choice?β€’ Try to develop a rigorous argument using basic

probability theory. – Very similar to your quiz questions.– Use Bayes Rule

Page 34: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ Exactly the process we just usedβ€’ The most important formula in probabilistic machine learning

𝑃𝑃 𝐴𝐴 𝐡𝐡 =𝑃𝑃(𝐡𝐡|𝐴𝐴) Γ— 𝑃𝑃 𝐴𝐴

𝑃𝑃 𝐡𝐡

(Super Easy) Derivation:𝑃𝑃 𝐴𝐴 ∧ 𝐡𝐡 = 𝑃𝑃 𝐴𝐴 𝐡𝐡 Γ— 𝑃𝑃 𝐡𝐡𝑃𝑃 𝐡𝐡 ∧ 𝐴𝐴 = 𝑃𝑃 𝐡𝐡 𝐴𝐴 Γ— 𝑃𝑃 𝐴𝐴

Just set equal...

and solve...Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

34

Bayes’ Rule

𝑃𝑃 𝐴𝐴 𝐡𝐡 Γ— 𝑃𝑃 𝐡𝐡 = 𝑃𝑃 𝐡𝐡 𝐴𝐴 Γ— 𝑃𝑃(𝐴𝐴)

𝑷𝑷 𝑨𝑨 ∧ 𝑩𝑩 = P(A, B)

Page 35: Generative Models; Naive Bayes

CIS 419/519 Fall’20 38

Bayes’ Rule for Machine Learning

β€’ Allows us to reason from evidence to hypothesesβ€’ Another way of thinking about Bayes’ rule:

Page 36: Generative Models; Naive Bayes

CIS 419/519 Fall’20 39

Basics of Bayesian Learningβ€’ Goal: find the best hypothesis from some space 𝐻𝐻 of hypotheses, given the

observed data (evidence) 𝐷𝐷.

β€’ Define best to be: most probable hypothesis in 𝐻𝐻‒ In order to do that, we need to assume a probability distribution over the class 𝐻𝐻.

β€’ In addition, we need to know something about the relation between the data observed and the hypotheses

– Think about tossing a coin – the probability of getting a Head is related to the Observation

– As we will see, we will be Bayesian about other things, e.g., the parameters of the model

Page 37: Generative Models; Naive Bayes

CIS 419/519 Fall’20 40

Basics of Bayesian Learningβ€’ 𝑃𝑃(β„Ž) - the prior probability of a hypothesis β„Ž

Reflects background knowledge; before data is observed. If no information - uniform distribution.

β€’ 𝑃𝑃(𝐷𝐷) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis)

β€’ 𝑃𝑃(𝐷𝐷|β„Ž): The probability of observing the sample 𝐷𝐷, given that hypothesis β„Ž is the target

β€’ 𝑃𝑃(β„Ž|𝐷𝐷): The posterior probability of β„Ž. The probability that β„Ž is the target, given that 𝐷𝐷 has been observed.

Page 38: Generative Models; Naive Bayes

CIS 419/519 Fall’20 41

Bayes Theorem

𝑃𝑃 β„Ž 𝐷𝐷 =𝑃𝑃(𝐷𝐷|β„Ž)𝑃𝑃 β„Ž

𝑃𝑃(𝐷𝐷)β€’ 𝑃𝑃(β„Ž|𝐷𝐷) increases with 𝑃𝑃(β„Ž) and with 𝑃𝑃(𝐷𝐷|β„Ž)

β€’ 𝑃𝑃(β„Ž|𝐷𝐷) decreases with 𝑃𝑃(𝐷𝐷)

Page 39: Generative Models; Naive Bayes

CIS 419/519 Fall’20 42

Learning Scenario𝑃𝑃(β„Ž|𝐷𝐷) = 𝑃𝑃(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)/𝑃𝑃(𝐷𝐷)

β€’ The learner considers a set of candidate hypotheses 𝐻𝐻 (models), and attempts to find the most probable one β„Ž ∈ 𝐻𝐻, given the observed data.

β€’ Such maximally probable hypothesis is called maximum a posteriorihypothesis (MAP); Bayes theorem is used to compute it:

β„Žπ‘€π‘€π΄π΄π‘ƒπ‘ƒ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(β„Ž|𝐷𝐷) = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)/𝑃𝑃(𝐷𝐷)

= π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)

Page 40: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β„Žπ‘€π‘€π΄π΄π‘ƒπ‘ƒ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃 β„Ž 𝐷𝐷 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π»π‘ƒπ‘ƒ(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)

β€’ We may assume that a priori, hypotheses are equally probable: 𝑃𝑃 β„Žπ‘Žπ‘Ž = 𝑃𝑃 β„Žπ‘—π‘— βˆ€ β„Žπ‘Žπ‘Ž,β„Žπ‘—π‘— ∈ 𝐻𝐻

β€’ We get the Maximum Likelihood hypothesis:

β„Žπ‘€π‘€πΏπΏ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(𝐷𝐷|β„Ž)

β€’ Here we just look for the hypothesis that best explains the data

43

Learning Scenario (2)

Page 41: Generative Models; Naive Bayes

CIS 419/519 Fall’20 44

Examplesβ€’ β„Žπ‘€π‘€π‘€π‘€π‘€π‘€ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(β„Ž|𝐷𝐷) = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π»π‘ƒπ‘ƒ(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)

β€’ A given coin is either fair or has a 60% bias in favor of Head.β€’ Decide what is the bias of the coin [This is a learning problem!]β€’ Two hypotheses: β„Ž1: 𝑃𝑃(𝐻𝐻) = 0.5; β„Ž2: 𝑃𝑃(𝐻𝐻) = 0.6

Prior: 𝑃𝑃(β„Ž): 𝑃𝑃(β„Ž1) = 0.75 𝑃𝑃(β„Ž2 ) = 0.25Now we need Data. 1st Experiment: coin toss came out 𝐻𝐻.– 𝑃𝑃(𝐷𝐷|β„Ž):

𝑃𝑃(𝐷𝐷|β„Ž1) = 0.5 ; 𝑃𝑃(𝐷𝐷|β„Ž2) = 0.6– 𝑃𝑃(𝐷𝐷):

𝑃𝑃(𝐷𝐷) = 𝑃𝑃(𝐷𝐷|β„Ž1)𝑃𝑃(β„Ž1) + 𝑃𝑃(𝐷𝐷|β„Ž2)𝑃𝑃(β„Ž2 )= 0.5 Γ— 0.75 + 0.6 Γ— 0.25 = 0.525

– 𝑃𝑃(β„Ž|𝐷𝐷):𝑃𝑃(β„Ž1|𝐷𝐷) = 𝑃𝑃(𝐷𝐷|β„Ž1)𝑃𝑃(β„Ž1)/𝑃𝑃(𝐷𝐷) = 0.5 Γ— 0.75/0.525 = 0.714𝑃𝑃(β„Ž2|𝐷𝐷) = 𝑃𝑃(𝐷𝐷|β„Ž2)𝑃𝑃(β„Ž2)/𝑃𝑃(𝐷𝐷) = 0.6 Γ— 0.25/0.525 = 0.286

Page 42: Generative Models; Naive Bayes

CIS 419/519 Fall’20 45

Examples(2)β€’ β„Žπ‘€π‘€π‘€π‘€π‘€π‘€ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(β„Ž|𝐷𝐷) = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π»π‘ƒπ‘ƒ(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)

β€’ A given coin is either fair or has a 60% bias in favor of Head.β€’ Decide what is the bias of the coin [This is a learning problem!]

β€’ Two hypotheses: β„Ž1: 𝑃𝑃(𝐻𝐻) = 0.5; β„Ž2: 𝑃𝑃(𝐻𝐻) = 0.6– Prior: 𝑃𝑃(β„Ž): 𝑃𝑃(β„Ž1) = 0.75 𝑃𝑃(β„Ž2) = 0.25

β€’ After 1st coin toss is 𝐻𝐻 we still think that the coin is more likely to be fair

β€’ If we were to use Maximum Likelihood approach (i.e., assume equal priors) we would think otherwise. The data supports the biased coin better.

β€’ Try: 100 coin tosses; 70 heads. β€’ You will believe that the coin is biased.

Page 43: Generative Models; Naive Bayes

CIS 419/519 Fall’20 46

Examples(2)β„Žπ‘€π‘€π‘€π‘€π‘€π‘€ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π» 𝑃𝑃(β„Ž|𝐷𝐷) = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π»π‘ƒπ‘ƒ(𝐷𝐷|β„Ž) 𝑃𝑃(β„Ž)

β€’ A given coin is either fair or has a 60% bias in favor of Head.β€’ Decide what is the bias of the coin [This is a learning problem!]

β€’ Two hypotheses: β„Ž1: 𝑃𝑃(𝐻𝐻) = 0.5; β„Ž2: 𝑃𝑃(𝐻𝐻) = 0.6– Prior: 𝑃𝑃(β„Ž): 𝑃𝑃(β„Ž1) = 0.75 𝑃𝑃(β„Ž2 ) = 0.25

β€’ Case of 100 coin tosses; 70 heads.

𝑃𝑃(𝐷𝐷) = 𝑃𝑃(𝐷𝐷|β„Ž1) 𝑃𝑃(β„Ž1) + 𝑃𝑃(𝐷𝐷|β„Ž2) 𝑃𝑃(β„Ž2) == 0.5100 Γ— 0.75 + 0.670 Γ— 0.430 Γ— 0.25 =

= 7.9 Γ— 10βˆ’31 Γ— 0.75 + 3.4 Γ— 10βˆ’28 Γ— 0.25

0.0057 = 𝑃𝑃(β„Ž1|𝐷𝐷) = 𝑃𝑃(𝐷𝐷|β„Ž1) 𝑃𝑃(β„Ž1)/𝑃𝑃(𝐷𝐷) << 𝑃𝑃(𝐷𝐷|β„Ž2) 𝑃𝑃(β„Ž2) /𝑃𝑃(𝐷𝐷) = 𝑃𝑃(β„Ž2|𝐷𝐷) = 0.9943

In this example we had to choose one hypothesis out of two possibilities; realistically, we could have infinitely many to choose from. We will still follow the maximum likelihood principle.

Page 44: Generative Models; Naive Bayes

CIS 419/519 Fall’20 48

Page 45: Generative Models; Naive Bayes

CIS 419/519 Fall’20 49

Page 46: Generative Models; Naive Bayes

CIS 419/519 Fall’20 50

Maximum Likelihood Estimateβ€’ Assume that you toss a (𝑝𝑝, 1 βˆ’ 𝑝𝑝) coin π‘Žπ‘Ž times and get 𝑝𝑝 Heads,

π‘Žπ‘Ž βˆ’ 𝑝𝑝 Tails. What is 𝑝𝑝?

β€’ If 𝑝𝑝 is the probability of Head, the probability of the data observed is:

𝑃𝑃 𝐷𝐷 𝑝𝑝 = π‘π‘π‘˜π‘˜ 1βˆ’ 𝑝𝑝 π‘šπ‘šβˆ’π‘˜π‘˜

β€’ The log Likelihood:

𝐿𝐿(𝑝𝑝) = log𝑃𝑃(𝐷𝐷|𝑝𝑝) = 𝑝𝑝 log(𝑝𝑝) + (π‘Žπ‘Ž βˆ’ 𝑝𝑝)log(1 βˆ’ 𝑝𝑝)

β€’ To maximize, set the derivative w.r.t. 𝑝𝑝 equal to 0:

𝑑𝑑𝐿𝐿 𝑝𝑝𝑑𝑑𝑝𝑝

=𝑝𝑝𝑝𝑝

β€“π‘Žπ‘Žβˆ’ 𝑝𝑝1 βˆ’ 𝑝𝑝

β€’ Solving this for 𝑝𝑝, gives: 𝑝𝑝 = π‘˜π‘˜π‘šπ‘š

2. In practice, smoothing is advisable –deriving the right smoothing can be done by assuming a prior.

1. The model we assumed is binomial. You could assume a different model! Next we will consider other models and see how to learn their parameters.

Page 47: Generative Models; Naive Bayes

CIS 419/519 Fall’20 51

Probability Distributionsβ€’ Bernoulli Distribution:

– Random Variable 𝑋𝑋 takes values {0, 1} s.t 𝑃𝑃(𝑋𝑋 = 1) = 𝑝𝑝 = 1 – 𝑃𝑃(𝑋𝑋 = 0)– (Think of tossing a coin)

β€’ Binomial Distribution: – Random Variable 𝑋𝑋 takes values {1, 2, … ,π‘Žπ‘Ž} representing the number of

successes (𝑋𝑋 = 1) in π‘Žπ‘Ž Bernoulli trials.– 𝑃𝑃 𝑋𝑋 = 𝑝𝑝 = 𝑓𝑓 π‘Žπ‘Ž,𝑝𝑝, 𝑝𝑝 = πΆπΆπ‘›π‘›π‘˜π‘˜ π‘π‘π‘˜π‘˜ 1 βˆ’ 𝑝𝑝 π‘›π‘›βˆ’π‘˜π‘˜

– Note that if 𝑋𝑋 ~ π΅π΅π‘Žπ‘Žπ‘Žπ‘Žπ΅π΅π‘Žπ‘Ž(π‘Žπ‘Ž,𝑝𝑝) and π‘Œπ‘Œ ~ π΅π΅π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ΅π΅π‘‡π‘‡π‘”π‘”π‘”π‘”π‘Žπ‘Ž 𝑝𝑝 , 𝑋𝑋 = βˆ‘π‘–π‘–=1𝑛𝑛 π‘Œπ‘Œ

– (Think of multiple coin tosses)

Page 48: Generative Models; Naive Bayes

CIS 419/519 Fall’20 52

Probability Distributions(2)β€’ Categorical Distribution:

– Random Variable 𝑋𝑋 takes on values in {1,2, …𝑝𝑝} s.t 𝑃𝑃(𝑋𝑋 = π‘Žπ‘Ž) = 𝑝𝑝𝑖𝑖 and βˆ‘1π‘˜π‘˜ 𝑝𝑝𝑖𝑖 = 1– (Think of a dice)

β€’ Multinomial Distribution:– Let the random variables 𝑋𝑋𝑖𝑖 (π‘Žπ‘Ž = 1, 2, … , 𝑝𝑝) indicates the number of times outcome

π‘Žπ‘Ž was observed over the π‘Žπ‘Ž trials. – The vector 𝑋𝑋 = (𝑋𝑋1, … ,π‘‹π‘‹π‘˜π‘˜) follows a multinomial distribution (π‘Žπ‘Ž, 𝑝𝑝) where 𝑝𝑝 =

(𝑝𝑝1, … , π‘π‘π‘˜π‘˜) and βˆ‘1π‘˜π‘˜ 𝑝𝑝𝑖𝑖 = 1– 𝑓𝑓 π‘Žπ‘Ž1, π‘Žπ‘Ž2, … π‘Žπ‘Žπ‘˜π‘˜ ,π‘Žπ‘Ž, 𝑝𝑝 = 𝑃𝑃 𝑋𝑋1 = π‘Žπ‘Ž1, … π‘‹π‘‹π‘˜π‘˜ = π‘Žπ‘Žπ‘˜π‘˜

= 𝑛𝑛!π‘₯π‘₯1!…π‘₯π‘₯π‘˜π‘˜!

𝑝𝑝1π‘₯π‘₯1 β€¦π‘π‘π‘˜π‘˜

π‘₯π‘₯π‘˜π‘˜ π‘ π‘ β„Žπ‘…π‘…π‘Žπ‘Ž βˆ‘π‘–π‘–=1π‘˜π‘˜ π‘Žπ‘Žπ‘–π‘– = π‘Žπ‘Ž

– (Think of π‘Žπ‘Ž tosses of a 𝑝𝑝 sided dice)

Page 49: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Our eventual goal will be: Given a document, predict whether it’s β€œgood” or β€œbad”

53

An Application: A Multinomial Bag of Words

β€’ We are given a collection of documents written in a three word language {π‘Žπ‘Ž, 𝑏𝑏, 𝑐𝑐}. All the documents have exactly π‘Žπ‘Ž words (each word can be either π‘Žπ‘Ž,𝑏𝑏 π΅π΅π‘Žπ‘Ž 𝑐𝑐).

β€’ We are given a labeled document collection of size m, {𝐷𝐷1,𝐷𝐷2 … ,π·π·π‘Žπ‘Ž}. The label π‘¦π‘¦π‘Žπ‘Ž of document π·π·π‘Žπ‘Ž is 1 or 0, indicating whether π·π·π‘Žπ‘Ž is β€œgood” or β€œbad”.

β€’ Our generative model uses the multinominal distribution. It first decides whether to generate a good or a bad document (with 𝑃𝑃(𝑦𝑦

π‘Žπ‘Ž= 1) = πœ‚πœ‚). Then, it places words in the document; let π‘Žπ‘Žπ‘Žπ‘Ž (π‘π‘π‘Žπ‘Ž, π‘π‘π‘Žπ‘Ž, resp.) be the number of

times word π‘Žπ‘Ž (𝑏𝑏, 𝑐𝑐, resp.) appears in document π·π·π‘Žπ‘Ž. That is, we have π‘Žπ‘Žπ‘Žπ‘Ž + 𝑏𝑏𝑖𝑖 + 𝑐𝑐𝑖𝑖 = |𝐷𝐷𝑖𝑖| = π‘Žπ‘Ž.β€’ In this generative model, we have:

𝑃𝑃(π·π·π‘Žπ‘Ž|𝑦𝑦 = 1) = π‘Žπ‘Ž!/(π‘Žπ‘Žπ‘Žπ‘Ž! 𝑏𝑏𝑖𝑖! 𝑐𝑐𝑖𝑖!) 𝛼𝛼1ai 𝛽𝛽1

bi 𝛾𝛾1ci

where 𝛼𝛼1 (𝛽𝛽1, 𝛾𝛾1 resp.) is the probability that π‘Žπ‘Ž (𝑏𝑏 , 𝑐𝑐) appears in a β€œgood” document. β€’ Similarly, 𝑃𝑃(π·π·π‘Žπ‘Ž|𝑦𝑦 = 0) = π‘Žπ‘Ž!/(π‘Žπ‘Žπ‘Žπ‘Ž! 𝑏𝑏𝑖𝑖! 𝑐𝑐𝑖𝑖!) 𝛼𝛼 0

ai 𝛽𝛽0bi 𝛾𝛾0

ci

β€’ Note that: 𝛼𝛼0 + 𝛽𝛽0 + 𝛾𝛾0 = 𝛼𝛼1 + 𝛽𝛽1 + 𝛾𝛾1 = 1

Unlike the discriminative case, the β€œgame” here is different: We make an assumption on how the data is being generated.

(multinomial, with πœ‚πœ‚,π›Όπ›Όπ‘Žπ‘Ž ,π›½π›½π‘Žπ‘Ž, π›Ύπ›Ύπ‘Žπ‘Ž ) We observe documents, and estimate these parameters (that’s the learning problem). Once we have the parameters, we can predict the corresponding label.

How do we model it?What is the learning problem?

Page 50: Generative Models; Naive Bayes

CIS 419/519 Fall’20

A Multinomial Bag of Words (2)β€’ We are given a collection of documents written in a three word language {π‘Žπ‘Ž, 𝑏𝑏, 𝑐𝑐}. All the documents have

exactly π‘Žπ‘Ž words (each word can be either π‘Žπ‘Ž,𝑏𝑏 π΅π΅π‘Žπ‘Ž 𝑐𝑐). β€’ We are given a labeled document collection {𝐷𝐷1,𝐷𝐷2 … ,π·π·π‘Žπ‘Ž}. The label π‘¦π‘¦π‘Žπ‘Ž of document π·π·π‘Žπ‘Ž is 1 or 0, indicating

whether π·π·π‘Žπ‘Ž is β€œgood” or β€œbad”.

β€’ The classification problem: given a document 𝐷𝐷, determine if it is good or bad; that is, determine 𝑃𝑃(𝑦𝑦|𝐷𝐷).

β€’ This can be determined via Bayes rule: 𝑃𝑃(𝑦𝑦|𝐷𝐷) = 𝑃𝑃(𝐷𝐷|𝑦𝑦) 𝑃𝑃(𝑦𝑦)/𝑃𝑃(𝐷𝐷)

β€’ But, we need to know the parameters of the model to compute that.

Page 51: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ How do we estimate the parameters?β€’ We derive the most likely value of the parameters defined above, by maximizing the log

likelihood of the observed data. β€’ 𝑃𝑃𝐷𝐷 = βˆπ‘–π‘– 𝑃𝑃(π‘¦π‘¦π‘Žπ‘Ž ,π·π·π‘Žπ‘Ž ) = βˆπ‘–π‘– 𝑃𝑃(𝐷𝐷𝑖𝑖 |𝑦𝑦𝑖𝑖 ) 𝑃𝑃(𝑦𝑦𝑖𝑖) =

– We denote by 𝑃𝑃(𝑦𝑦𝑖𝑖 = 1) = πœ‚πœ‚ the probability that an example is β€œgood” (π‘¦π‘¦π‘Žπ‘Ž = 1; otherwise π‘¦π‘¦π‘Žπ‘Ž =0). Then:

�𝑖𝑖

𝑃𝑃(𝑦𝑦,𝐷𝐷𝑖𝑖 ) = �𝑖𝑖

[ πœ‚πœ‚π‘Žπ‘Ž!

π‘Žπ‘Žπ‘Žπ‘Ž! 𝑏𝑏𝑖𝑖! 𝑐𝑐𝑖𝑖!𝛼𝛼1

ai 𝛽𝛽1bi 𝛾𝛾1

π‘π‘π‘Žπ‘Žπ‘¦π‘¦π‘Žπ‘Ž β‹… 1 βˆ’ πœ‚πœ‚

π‘Žπ‘Ž!π‘Žπ‘Žπ‘Žπ‘Ž! 𝑏𝑏𝑖𝑖! 𝑐𝑐𝑖𝑖!

𝛼𝛼 0ai 𝛽𝛽0

bi 𝛾𝛾0ci

1βˆ’π‘¦π‘¦π‘–π‘–]

β€’ We want to maximize it with respect to each of the parameters. We first compute log(𝑃𝑃𝐷𝐷) and then differentiate:

β€’ log(𝑃𝑃𝐷𝐷) = βˆ‘π‘Žπ‘Žπ‘¦π‘¦π‘Žπ‘Ž [ log(πœ‚πœ‚) + 𝐢𝐢 + π‘Žπ‘Žπ‘–π‘– log(𝛼𝛼1) + 𝑏𝑏𝑖𝑖 log(𝛽𝛽1) + 𝑐𝑐𝑖𝑖 log(𝛾𝛾1)] +

(1 βˆ’ π‘¦π‘¦π‘Žπ‘Ž) [log(1 βˆ’ πœ‚πœ‚) + 𝐢𝐢𝐢 + π‘Žπ‘Žπ‘–π‘– log(𝛼𝛼0) + 𝑏𝑏𝑖𝑖 log(𝛽𝛽0) + 𝑐𝑐𝑖𝑖 log(𝛾𝛾0) ]

β€’ π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘€π‘€π·π·πœ‚πœ‚

= βˆ‘π‘–π‘–[π‘¦π‘¦π‘–π‘–πœ‚πœ‚βˆ’ 1βˆ’π‘¦π‘¦π‘–π‘–

1 βˆ’πœ‚πœ‚] = 0 β†’ βˆ‘π‘Žπ‘Ž (𝑦𝑦𝑖𝑖 βˆ’ πœ‚πœ‚) = 0 β†’ πœ‚πœ‚ = βˆ‘π‘Žπ‘Ž

π‘¦π‘¦π‘–π‘–π‘šπ‘š

β€’ The same can be done for the other 6 parameters. However, notice that they are not independent: 𝛼𝛼0+ 𝛽𝛽0+ 𝛾𝛾0=𝛼𝛼1+ 𝛽𝛽1+ 𝛾𝛾1 =1 and also π‘Žπ‘Žπ‘Žπ‘Ž + 𝑏𝑏𝑖𝑖 + 𝑐𝑐𝑖𝑖 = |𝐷𝐷𝑖𝑖| = π‘Žπ‘Ž.

55

A Multinomial Bag of Words (3)

Labeled data, assuming that the examples are independent

Notice that this is an important trick to write

down the joint probability without knowing what the outcome of the

experiment is. The ithexpression evaluates to

𝑝𝑝(𝐷𝐷𝑖𝑖 ,π‘¦π‘¦π‘Žπ‘Ž)(Could be written as a sumwith multiplicative π‘¦π‘¦π‘Žπ‘Ž but

less convenient) Makes sense?

Page 52: Generative Models; Naive Bayes

CIS 419/519 Fall’20 56

Administration (11/30/20)β€’ Remember that all the lectures are available on the website before the class

– Go over it and be prepared– A new set of written notes will accompany most lectures, with some more details, examples and,

(when relevant) some code.

β€’ HW4 is out – NNs and Bayesian Learning– Due 12/3 12/7 (Next Monday)

β€’ Next week is the last week of classes – there will be three meetings (Monday, Wednesday, Thursday)

β€’ Projects– Most of you have chosen a project and a team. – You should have received an email from one of the TAs that will help guide your project

β€’ The Final is on 12/18; it is scheduled for 9am; we will try to given you a larger window of time. β€’ Project Paper and Video Submission: 12/21

Are we recording? YES!

Available on the web site

Page 53: Generative Models; Naive Bayes

CIS 419/519 Fall’20 57

Projectsβ€’ CIS 519 students need to do a team project: Read the project descriptions and follow the updates on the Project webpage

– Teams will be of size 2-4– We will help grouping if needed

β€’ There will be 3 options for projects. – Natural Language Processing (Text)– Computer Vision (Images)– Speech (Audio)

β€’ In all cases, we will give you datasets and initial ideas– The problem will be multiclass classification problems– You will get annotated data only for some of the labels, but will also have to predict other labels– 0-zero shot learning; few-shot learning; transfer learning

β€’ A detailed note will come out today.

β€’ Timeline:– 11/11 Choose a project and team up– 11/23 Initial proposal describing what your team plans to do – 12/2 Progress report: 1 page. What you have done; plans; problems. – 12/21 Final paper + short video

β€’ Try to make it interesting!

Page 54: Generative Models; Naive Bayes

CIS 419/519 Fall’20 58

Bayesian Learning

Page 55: Generative Models; Naive Bayes

CIS 419/519 Fall’20 59

Generating Sequences: HMMsβ€’ Consider the follow process of generating a sentence: β€’ First choose a part-of-speech tag (noun, verb, adjective, determiners, etc.)

– Given that, choose a word.– Next choose the next part of speech, and given it, a word.– Keep on going, until you choose a punctuation mark as a pos tag and a period.

The

DET N .AdjV

.isbook

10.2 0.50.50.5

0.40.250.250.250.25

interesting

β€’ What are the probabilities you need to have in order to generate this sentence?

β€’ Transition: P(post| post-1) (#(pos) x #(pos))β€’ Emission: P(w|pos) (#(w) x #(pos))β€’ Initial: p0(pos) (#(pos)β€’ Given data, you can estimate these

probabilities β€’ Once you have it, when observing a sentence

you can predict the pos-tag sequence of it.

Page 56: Generative Models; Naive Bayes

CIS 419/519 Fall’20 60

Generating Sequences: HMMsβ€’ Consider data over 5 characters, π‘Žπ‘Ž = π‘Žπ‘Ž, 𝑏𝑏, 𝑐𝑐,𝑑𝑑, 𝑅𝑅, and 3 states 𝑠𝑠 = 𝐡𝐡, 𝐼𝐼,𝑂𝑂

– Think about chunking a sentence to phrases; 𝐡𝐡 is the Beginning of each phrase, I is Inside a phrase, O is outside . – (Or: think about being in one of two rooms, observing π‘Žπ‘Ž in it, moving to the other)

β€’ We generate characters according to:β€’ Initial state prob: 𝑝𝑝 𝐡𝐡 = 1; 𝑝𝑝 𝐼𝐼 = 𝑝𝑝 𝑂𝑂 = 0β€’ State transition prob:

– 𝑝𝑝(𝐡𝐡→𝐡𝐡) = 0.8 𝑝𝑝(𝐡𝐡→𝐼𝐼) = 0.2– 𝑝𝑝(𝐼𝐼→𝐡𝐡) = 0.5 𝑝𝑝(𝐼𝐼→𝐼𝐼) = 0.5

β€’ Output prob:– 𝑝𝑝(π‘Žπ‘Ž|𝐡𝐡) = 0.25,𝑝𝑝(𝑏𝑏|𝐡𝐡) = 0.10,𝑝𝑝(𝑐𝑐|𝐡𝐡) = 0.10, … .– 𝑝𝑝(π‘Žπ‘Ž|𝐼𝐼) = 0.25,𝑝𝑝(𝑏𝑏, 𝐼𝐼) = 0, …

β€’ Can follow the generation process to get the observed sequence.

𝑃𝑃(𝐡𝐡) = 1𝑃𝑃(𝐼𝐼) = 0

𝑃𝑃(π‘Žπ‘Ž|𝐡𝐡)B I

0.80.2

0.5

0.5𝑃𝑃(π‘Žπ‘Ž|𝐼𝐼)

0.80.2

0.5

0.5

a

B I BII

ddc

10.2 0.50.50.5

0.40.250.250.250.25

a

We can do the same exercise we did before.

Data: π‘Žπ‘Ž1 ,π‘Žπ‘Ž2, … π‘Žπ‘Žπ‘Žπ‘Ž , 𝑠𝑠1 , 𝑠𝑠2, … π‘ π‘ π‘Žπ‘Ž 1𝑛𝑛

Find the most likely parameters of the model:𝑃𝑃(π‘Žπ‘Žπ‘–π‘– |π‘ π‘ π‘Žπ‘Ž),𝑃𝑃(𝑠𝑠𝑖𝑖+1 |π‘ π‘ π‘Žπ‘Ž),𝑝𝑝(𝑠𝑠1)

Given an unlabeled example (observation)π‘Žπ‘Ž = (π‘Žπ‘Ž1,π‘Žπ‘Ž2, β€¦π‘Žπ‘Žπ‘Žπ‘Ž)

use Bayes rule to predict the label 𝑔𝑔 =(𝑠𝑠1, 𝑠𝑠2, … π‘ π‘ π‘Žπ‘Ž):

π‘”π‘”βˆ— = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘” 𝑃𝑃(𝑔𝑔|π‘Žπ‘Ž) =π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘” 𝑃𝑃(π‘Žπ‘Ž|𝑔𝑔) 𝑃𝑃(𝑔𝑔)/𝑃𝑃(π‘Žπ‘Ž)

The only issue is computational: there are 2π‘Žπ‘Žpossible values of 𝑔𝑔 (labels)

This is an HMM model (but nothing was hidden; the π‘ π‘ π‘Žπ‘Ž were given in training. It’s also possible to solve when the 𝑠𝑠1 , 𝑠𝑠2, … π‘ π‘ π‘Žπ‘Ž are hidden, via EM.

Page 57: Generative Models; Naive Bayes

CIS 419/519 Fall’20 61

Bayes Optimal Classifierβ€’ How should we use the general formalism?β€’ What should 𝐻𝐻 be?β€’ 𝐻𝐻 can be a collection of functions. Given the training data,

choose an optimal function. Then, given new data, evaluate the selected function on it.

β€’ 𝐻𝐻 can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction.

β€’ Could be different!

Page 58: Generative Models; Naive Bayes

CIS 419/519 Fall’20 62

Bayes Optimal Classifierβ€’ The first formalism suggests to learn a good hypothesis and

use it. β€’ (Language modeling, grammar learning, etc. are here)

β„Žπ‘€π‘€π‘€π‘€π‘€π‘€ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π»π‘ƒπ‘ƒ β„Ž 𝐷𝐷 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žβ„Žβˆˆπ»π»π‘ƒπ‘ƒ 𝐷𝐷 β„Ž 𝑃𝑃(β„Ž)

β€’ The second one suggests to directly choose a decision.[it/in]:β€’ This is the issue of β€œthresholding” vs. entertaining all options

until the last minute. (Computational Issues)

Page 59: Generative Models; Naive Bayes

CIS 419/519 Fall’20 63

Bayes Optimal Classifier: Exampleβ€’ Assume a space of 3 hypotheses:

– 𝑃𝑃 β„Ž1 𝐷𝐷 = 0.4; 𝑃𝑃 β„Ž2 𝐷𝐷 = 0.3; 𝑃𝑃 β„Ž3 𝐷𝐷 = 0.3 β†’ β„Žπ‘€π‘€π‘€π‘€π‘€π‘€ = β„Ž1β€’ Given a new instance π‘Žπ‘Ž, assume that

– β„Ž1(𝒙𝒙) = 1 β„Ž2(𝒙𝒙) = 0 β„Ž3(𝒙𝒙) = 0β€’ In this case,

– 𝑃𝑃(𝑓𝑓(𝒙𝒙) = 1 ) = 0.4 ; 𝑃𝑃(𝑓𝑓(𝒙𝒙) = 0) = 0.6 𝑏𝑏𝑇𝑇𝑑𝑑 β„Žπ‘€π‘€π‘€π‘€π‘€π‘€ (𝒙𝒙) = 1

β€’ We want to determine the most probable classification by combining the prediction of all hypotheses, weighted by their posterior probabilities

β€’ Think about it as deferring your commitment – don’t commit to the most likely hypothesis until you have to make a decision (this has computational consequences)

Page 60: Generative Models; Naive Bayes

CIS 419/519 Fall’20 64

Bayes Optimal Classifier: Example(2)β€’ Let 𝑉𝑉 be a set of possible classifications

𝑃𝑃 𝑣𝑣𝑗𝑗 𝐷𝐷 = οΏ½β„Žπ‘–π‘–βˆˆπ»π»

𝑃𝑃 𝑣𝑣𝑗𝑗 β„Žπ‘–π‘– ,𝐷𝐷 𝑃𝑃 β„Žπ‘–π‘– 𝐷𝐷 = οΏ½β„Žπ‘–π‘–βˆˆπ»π»

𝑃𝑃 𝑣𝑣𝑗𝑗 β„Žπ‘–π‘– 𝑃𝑃(β„Žπ‘–π‘–|𝐷𝐷)

β€’ Bayes Optimal Classification:

𝑣𝑣 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 𝐷𝐷 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰ οΏ½β„Žπ‘–π‘–βˆˆπ»π»

𝑃𝑃 𝑣𝑣𝑗𝑗 β„Žπ‘–π‘– 𝑃𝑃(β„Žπ‘–π‘–|𝐷𝐷)

β€’ In the example:

𝑃𝑃 1 𝐷𝐷 = οΏ½β„Žπ‘–π‘–βˆˆπ»π»

𝑃𝑃 1 β„Žπ‘–π‘– 𝑃𝑃 β„Žπ‘–π‘– 𝐷𝐷 = 1 βˆ— 0.4 + 0 βˆ— 0.3 + 0 βˆ— 0.3 = 0.4

𝑃𝑃 0 𝐷𝐷 = οΏ½β„Žπ‘–π‘–βˆˆπ»π»

𝑃𝑃 0 β„Žπ‘–π‘– 𝑃𝑃 β„Žπ‘–π‘– 𝐷𝐷 = 0 βˆ— 0.4 + 1 βˆ— 0.3 + 1 βˆ— 0.3 = 0.6

β€’ and the optimal prediction is indeed 0.β€’ The key example of using a β€œBayes optimal Classifier” is that of the NaΓ―ve Bayes algorithm.

Page 61: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Bayesian Classifierβ€’ 𝑓𝑓:𝑿𝑿→𝑉𝑉, finite set of valuesβ€’ Instances π’™π’™βˆˆπ‘Ώπ‘Ώ can be described as a collection of features

𝒙𝒙 = π‘Žπ‘Ž1, π‘Žπ‘Ž2, … π‘Žπ‘Žπ‘Žπ‘Ž π‘Žπ‘Žπ‘–π‘– ∈ {0,1}β€’ Given an example, assign it the most probable value in 𝑉𝑉‒ Bayes Rule: 𝑣𝑣𝑀𝑀𝑀𝑀𝑀𝑀 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 π‘Žπ‘Ž = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ(𝑣𝑣𝑗𝑗|π‘Žπ‘Ž1, π‘Žπ‘Ž2, … π‘Žπ‘Žπ‘Žπ‘Ž)

𝑣𝑣𝑀𝑀𝑀𝑀𝑀𝑀 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›|𝑣𝑣𝑗𝑗)𝑃𝑃 𝑣𝑣𝑗𝑗

𝑃𝑃(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›)= π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›|𝑣𝑣𝑗𝑗)𝑃𝑃 𝑣𝑣𝑗𝑗

β€’ Notational convention: 𝑃𝑃(𝑦𝑦) means 𝑃𝑃(π‘Œπ‘Œ = 𝑦𝑦)

65

Page 62: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Bayesian Classifier𝑉𝑉𝑀𝑀𝐴𝐴𝑃𝑃 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£ 𝑃𝑃(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘› | 𝑣𝑣 )𝑃𝑃(𝑣𝑣)

β€’ Given training data we can estimate the two terms.β€’ Estimating 𝑃𝑃(𝑣𝑣) is easy. E.g., under the binomial distribution assumption, count the number of

times 𝑣𝑣 appears in the training data. β€’ However, it is not feasible to estimate 𝑃𝑃(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘Žπ‘Ž | 𝑣𝑣 )

– In this case we have to estimate, for each target value, the probability of each instance – (most of which will not occur).

β€’ In order to use a Bayesian classifiers in practice, we need to make assumptions that will allow us to estimate these quantities.

66

Page 63: Generative Models; Naive Bayes

CIS 419/519 Fall’20 67

Naive Bayes𝑉𝑉𝑀𝑀𝐴𝐴𝑃𝑃 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£ 𝑃𝑃(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘› | 𝑣𝑣 )𝑃𝑃(𝑣𝑣)

𝑃𝑃 π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘› 𝑣𝑣𝑗𝑗 = 𝑃𝑃 π‘Žπ‘Ž1 π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›, 𝑣𝑣𝑗𝑗 𝑃𝑃(π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›|𝑣𝑣𝑗𝑗)= 𝑃𝑃 π‘Žπ‘Ž1 π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›, 𝑣𝑣𝑗𝑗 𝑃𝑃 π‘Žπ‘Ž2 π‘Žπ‘Ž3, … , π‘Žπ‘Žπ‘›π‘›,𝑣𝑣𝑗𝑗 𝑃𝑃(π‘Žπ‘Ž3, … , π‘Žπ‘Žπ‘›π‘›|𝑣𝑣𝑗𝑗)

= β‹―= 𝑃𝑃 π‘Žπ‘Ž1 π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘›π‘›, 𝑣𝑣𝑗𝑗 𝑃𝑃 π‘Žπ‘Ž2 π‘Žπ‘Ž3, … , π‘Žπ‘Žπ‘›π‘›, 𝑣𝑣𝑗𝑗 𝑃𝑃 π‘Žπ‘Ž3 π‘Žπ‘Ž4, … , π‘Žπ‘Žπ‘›π‘›, 𝑣𝑣𝑗𝑗 …𝑃𝑃 π‘Žπ‘Žπ‘›π‘› 𝑣𝑣𝑗𝑗

= �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗)

β€’ Assumption: feature values are independent given the target value

Page 64: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Naive Bayes (2)𝑉𝑉𝑀𝑀𝐴𝐴𝑃𝑃 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£ 𝑃𝑃(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘Žπ‘Ž | 𝑣𝑣 )𝑃𝑃(𝑣𝑣)

β€’ Assumption: feature values are independent given the target value

𝑃𝑃 π‘Žπ‘Ž1 = 𝑏𝑏1, π‘Žπ‘Ž2 = 𝑏𝑏2, … , π‘Žπ‘Žπ‘›π‘› = 𝑏𝑏𝑛𝑛 𝑣𝑣 = 𝑣𝑣𝑗𝑗 = �𝑖𝑖=1

𝑛𝑛

(π‘Žπ‘Žπ‘–π‘– = 𝑏𝑏𝑖𝑖|𝑣𝑣 = 𝑣𝑣𝑗𝑗 )

β€’ Generative model:β€’ First choose a label value 𝑣𝑣𝑗𝑗 βˆˆπ‘‰π‘‰ according to 𝑃𝑃(𝑣𝑣)β€’ For each 𝑣𝑣𝑗𝑗 : choose π‘Žπ‘Ž1 π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘Žπ‘Ž according to 𝑃𝑃(π‘Žπ‘Žπ‘π‘ |𝑣𝑣𝑗𝑗 )

68

Page 65: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Naive Bayes (3)𝑉𝑉𝑀𝑀𝐴𝐴𝑃𝑃 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£ 𝑃𝑃(π‘Žπ‘Ž1, π‘Žπ‘Ž2, … , π‘Žπ‘Žπ‘Žπ‘Ž | 𝑣𝑣 )𝑃𝑃(𝑣𝑣)

β€’ Assumption: feature values are independent given the target value

𝑃𝑃 π‘Žπ‘Ž1 = 𝑏𝑏1, π‘Žπ‘Ž2 = 𝑏𝑏2, … , π‘Žπ‘Žπ‘›π‘› = 𝑏𝑏𝑛𝑛 𝑣𝑣 = 𝑣𝑣𝑗𝑗 = βˆπ‘–π‘–=1𝑛𝑛 (π‘Žπ‘Žπ‘–π‘– = 𝑏𝑏𝑖𝑖|𝑣𝑣 = 𝑣𝑣𝑗𝑗 )

β€’ Learning method: Estimate π‘Žπ‘Ž|𝑉𝑉| + |𝑉𝑉| parameters and use them to make a prediction. (How to estimate?)

β€’ Notice that this is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions).

β€’ This is learning without trying to achieve consistency or even approximate consistency.β€’ Why does it work?

69

Page 66: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ Notice that the features values are conditionally independent given the target value, and are not required to be independent.

β€’ Example: The Boolean features are π‘Žπ‘Ž and 𝑦𝑦. We define the label to be 𝑔𝑔 = 𝑓𝑓 π‘Žπ‘Ž,𝑦𝑦 = π‘Žπ‘Ž ∧ 𝑦𝑦 over the product distribution:

𝑝𝑝(π‘Žπ‘Ž = 0) = 𝑝𝑝(π‘Žπ‘Ž = 1) = 12

and 𝑝𝑝(𝑦𝑦 = 0) = 𝑝𝑝(𝑦𝑦 = 1) = 12

The distribution is defined so that π‘Žπ‘Ž and 𝑦𝑦 are independent: 𝑝𝑝(π‘Žπ‘Ž, 𝑦𝑦) = 𝑝𝑝(π‘Žπ‘Ž)𝑝𝑝(𝑦𝑦)

That is:

β€’ But, given that 𝑔𝑔 = 0:𝑝𝑝(π‘Žπ‘Ž = 1| 𝑔𝑔 = 0) = 𝑝𝑝(𝑦𝑦 = 1 𝑔𝑔 = 0 = 1

3while: 𝑝𝑝(π‘Žπ‘Ž = 1,𝑦𝑦 = 1 | 𝑔𝑔 = 0) = 0so π‘Žπ‘Ž and 𝑦𝑦 are not conditionally independent.

70

Conditional Independence (1)

𝑋𝑋 = 0 𝑋𝑋 = 1π‘Œπ‘Œ = 0 ΒΌ (𝑔𝑔 = 0) ΒΌ (𝑔𝑔 = 0)π‘Œπ‘Œ = 1 ΒΌ (𝑔𝑔 = 0) ΒΌ (𝑔𝑔 = 1)

Page 67: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ The other direction also does not hold. π‘Žπ‘Ž and 𝑦𝑦 can be conditionally independent but not independent.

β€’ Example: We define a distribution s.t.:𝑔𝑔 = 0: 𝑝𝑝(π‘Žπ‘Ž = 1| 𝑔𝑔 = 0) = 1, 𝑝𝑝(𝑦𝑦 = 1| 𝑔𝑔 = 0) = 0𝑔𝑔 = 1: 𝑝𝑝(π‘Žπ‘Ž = 1| 𝑔𝑔 = 1) = 0, 𝑝𝑝(𝑦𝑦 = 1| 𝑔𝑔 = 1) = 1and assume, that: 𝑝𝑝(𝑔𝑔 = 0) = 𝑝𝑝(𝑔𝑔 = 1) = 1

2

β€’ Given the value of l, π‘Žπ‘Ž and 𝑦𝑦 are independent (check)β€’ What about unconditional independence ?𝑝𝑝(π‘Žπ‘Ž = 1) = 𝑝𝑝(π‘Žπ‘Ž = 1| 𝑔𝑔 = 0)𝑝𝑝(𝑔𝑔 = 0) + 𝑝𝑝(π‘Žπ‘Ž = 1| 𝑔𝑔 = 1)𝑝𝑝(𝑔𝑔 = 1) = 0.5 + 0 = 0.5𝑝𝑝(𝑦𝑦 = 1) = 𝑝𝑝(𝑦𝑦 = 1| 𝑔𝑔 = 0)𝑝𝑝(𝑔𝑔 = 0) + 𝑝𝑝(𝑦𝑦 = 1| 𝑔𝑔 = 1)𝑝𝑝(𝑔𝑔 = 1) = 0 + 0.5 = 0.5

But,𝑝𝑝(π‘Žπ‘Ž = 1,𝑦𝑦 = 1) = 𝑝𝑝(π‘Žπ‘Ž = 1,𝑦𝑦 = 1| 𝑔𝑔 = 0)𝑝𝑝(𝑔𝑔 = 0) + 𝑝𝑝(π‘Žπ‘Ž = 1,𝑦𝑦 = 1| 𝑔𝑔 = 1)𝑝𝑝(𝑔𝑔 = 1) = 0

so π‘Žπ‘Ž and 𝑦𝑦 are not independent.

71

Conditional Independence (2)

𝑋𝑋 = 0 𝑋𝑋 = 1π‘Œπ‘Œ = 0 0 (𝑔𝑔 = 0) Β½ (𝑔𝑔 = 0)π‘Œπ‘Œ = 1 Β½ (𝑔𝑔 = 1) 0 (𝑔𝑔 = 1)

Page 68: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Day Outlook Temperature Humidity Wind PlayTennis

1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No 72

NaΓ―ve Bayes Example 𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰ P vj �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗)

Page 69: Generative Models; Naive Bayes

CIS 419/519 Fall’20 73

Estimating Probabilitiesβ€’ 𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£βˆˆ 𝑦𝑦𝑒𝑒𝑦𝑦,𝑛𝑛𝑑𝑑 P v βˆπ‘–π‘– 𝑃𝑃(π‘Žπ‘Žπ‘–π‘– = π΅π΅π‘π‘π‘ π‘ π‘…π‘…π‘Žπ‘Žπ‘£π‘£π‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Ž|𝑣𝑣)β€’ How do we estimate 𝑃𝑃 π΅π΅π‘π‘π‘ π‘ π‘…π‘…π‘Žπ‘Žπ‘£π‘£π‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Ž 𝑣𝑣 ?

Page 70: Generative Models; Naive Bayes

CIS 419/519 Fall’20 74

Exampleβ€’ Compute 𝑃𝑃(π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠); 𝑃𝑃(π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = π‘Žπ‘Žπ΅π΅)β€’ Compute 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = 𝑠𝑠/𝐡𝐡𝑐𝑐/π‘Žπ‘Ž | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (6 numbers)β€’ Compute 𝑃𝑃(π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘π‘ = β„Ž/π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘‘π‘‘/𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (6 numbers)β€’ Compute 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž/π‘Žπ‘Žπ΅π΅π‘Žπ‘Ž | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (4 numbers)β€’ Compute 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = 𝑠𝑠/𝑠𝑠𝑑𝑑 | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (4 numbers)

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰ P vj �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗)

Page 71: Generative Models; Naive Bayes

CIS 419/519 Fall’20 75

Exampleβ€’ Compute 𝑃𝑃(π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠); 𝑃𝑃(π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = π‘Žπ‘Žπ΅π΅)β€’ Compute 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = 𝑠𝑠/𝐡𝐡𝑐𝑐/π‘Žπ‘Ž | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (6 numbers)β€’ Compute 𝑃𝑃(π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘π‘ = β„Ž/π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘‘π‘‘/𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (6 numbers)β€’ Compute 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž/π‘Žπ‘Žπ΅π΅π‘Žπ‘Ž| π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (4 numbers)β€’ Compute 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = 𝑠𝑠/𝑠𝑠𝑑𝑑 | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (4 numbers)

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 �𝑖𝑖

𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗

Page 72: Generative Models; Naive Bayes

CIS 419/519 Fall’20 76

Page 73: Generative Models; Naive Bayes

CIS 419/519 Fall’20 77

Exampleβ€’ Compute 𝑃𝑃(π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠); 𝑃𝑃(π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = π‘Žπ‘Žπ΅π΅)β€’ Compute 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = 𝑠𝑠/𝐡𝐡𝑐𝑐/π‘Žπ‘Ž | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (6 numbers)β€’ Compute 𝑃𝑃(π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘π‘ = β„Ž/π‘Žπ‘Žπ‘Žπ‘Žπ‘”π‘”π‘‘π‘‘/𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (6 numbers)β€’ Compute 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž/π‘Žπ‘Žπ΅π΅π‘Žπ‘Ž| π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (4 numbers)β€’ Compute 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = 𝑠𝑠/𝑠𝑠𝑑𝑑 | π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠/π‘Žπ‘Žπ΅π΅) (4 numbers)

β€’ (1,4,4,2,2, independent numbers)

β€’ Given a new instance:(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

β€’ Predict: π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = ?

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 �𝑖𝑖

𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗

Page 74: Generative Models; Naive Bayes

CIS 419/519 Fall’20 78

Exampleβ€’ Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)β€’ 𝑃𝑃 π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠 𝑃𝑃 π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = π‘Žπ‘Žπ΅π΅

= 9/14 = 0.64) = 5/14 = 0.36

β€’ 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ | 𝑦𝑦𝑅𝑅𝑠𝑠) = 2/9 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ | π‘Žπ‘Žπ΅π΅) = 3/5β€’ 𝑃𝑃(π‘‘π‘‘π‘…π‘…π‘Žπ‘Žπ‘π‘ = 𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | 𝑦𝑦𝑅𝑅𝑠𝑠) = 3/9 𝑃𝑃(π‘‘π‘‘π‘…π‘…π‘Žπ‘Žπ‘π‘ = 𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | π‘Žπ‘Žπ΅π΅) = 1/5β€’ 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž |𝑦𝑦𝑅𝑅𝑠𝑠) = 3/9 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž | π‘Žπ‘Žπ΅π΅) = 4/5β€’ 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = π‘ π‘ π‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Žπ‘Žπ‘Ž | 𝑦𝑦𝑅𝑅𝑠𝑠) = 3/9 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = π‘ π‘ π‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Žπ‘Žπ‘Ž | π‘Žπ‘Žπ΅π΅) = 3/5

β€’ 𝑃𝑃(𝑦𝑦𝑅𝑅𝑠𝑠, … . . ) ~ 0.0053 𝑃𝑃(π‘Žπ‘Žπ΅π΅, … . . ) ~ 0.0206

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 �𝑖𝑖

𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗

Page 75: Generative Models; Naive Bayes

CIS 419/519 Fall’20 79

Exampleβ€’ Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)β€’ 𝑃𝑃 π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = 𝑦𝑦𝑅𝑅𝑠𝑠 𝑃𝑃 π‘ƒπ‘ƒπ‘”π‘”π‘Žπ‘Žπ‘¦π‘¦π‘‡π‘‡π‘…π‘…π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  = π‘Žπ‘Žπ΅π΅

= 9/14 = 0.64) = 5/14 = 0.36

β€’ 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ | 𝑦𝑦𝑅𝑅𝑠𝑠) = 2/9 𝑃𝑃(𝐡𝐡𝑇𝑇𝑑𝑑𝑔𝑔𝐡𝐡𝐡𝐡𝑝𝑝 = π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ | π‘Žπ‘Žπ΅π΅) = 3/5β€’ 𝑃𝑃(π‘‘π‘‘π‘…π‘…π‘Žπ‘Žπ‘π‘ = 𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | 𝑦𝑦𝑅𝑅𝑠𝑠) = 3/9 𝑃𝑃(π‘‘π‘‘π‘…π‘…π‘Žπ‘Žπ‘π‘ = 𝑐𝑐𝐡𝐡𝐡𝐡𝑔𝑔 | π‘Žπ‘Žπ΅π΅) = 1/5β€’ 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž |𝑦𝑦𝑅𝑅𝑠𝑠) = 3/9 𝑃𝑃(β„Žπ‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘¦π‘¦ = β„Žπ‘Žπ‘Ž | π‘Žπ‘Žπ΅π΅) = 4/5β€’ 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = π‘ π‘ π‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Žπ‘Žπ‘Ž | 𝑦𝑦𝑅𝑅𝑠𝑠) = 3/9 𝑃𝑃(π‘ π‘ π‘Žπ‘Žπ‘Žπ‘Žπ‘‘π‘‘ = π‘ π‘ π‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Žπ‘Žπ‘Ž | π‘Žπ‘Žπ΅π΅) = 3/5

β€’ 𝑃𝑃 𝑦𝑦𝑅𝑅𝑠𝑠, … . . ~ 0.0053 𝑃𝑃 π‘Žπ‘Žπ΅π΅, … . . ~ 0.0206β€’ 𝑃𝑃(π‘Žπ‘Žπ΅π΅|π‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘ π‘‘π‘‘π‘Žπ‘Žπ‘Žπ‘Žπ‘π‘π‘…π‘…) = 0.0206/(0.0053 + 0.0206) = 0.795

What if we were asked about Outlook=OC ?

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 �𝑖𝑖

𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗

Page 76: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Day Outlook Temperature Humidity Wind PlayTennis

1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No 80

NaΓ―ve Bayes Example 𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰ P vj �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗)

Page 77: Generative Models; Naive Bayes

CIS 419/519 Fall’20 81

Estimating Probabilities

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£ �𝑖𝑖

𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣 𝑃𝑃(𝑣𝑣)

β€’ How do we estimate 𝑃𝑃(π‘Žπ‘Žπ‘–π‘– | 𝑣𝑣) ?β€’ As we suggested before, we made a Binomial assumption; then:

P(π‘Žπ‘Žπ‘–π‘– = 1 𝑣𝑣 = #(π‘₯π‘₯𝑖𝑖=1 𝑖𝑖𝑛𝑛 π‘‘π‘‘π‘‘π‘‘π‘Žπ‘Žπ‘–π‘–π‘›π‘›π‘–π‘–π‘›π‘›π‘‘π‘‘ 𝑖𝑖𝑛𝑛 π‘Žπ‘Žπ‘›π‘› 𝑒𝑒π‘₯π‘₯π‘Žπ‘Žπ‘šπ‘šπ‘’π‘’π‘‘π‘‘π‘’π‘’ π‘‘π‘‘π‘Žπ‘Žπ‘π‘π‘’π‘’π‘‘π‘‘π‘‘π‘‘ 𝑣𝑣)#(𝑣𝑣 π‘‘π‘‘π‘Žπ‘Žπ‘π‘π‘’π‘’π‘‘π‘‘π‘’π‘’π‘‘π‘‘ 𝑒𝑒π‘₯π‘₯π‘Žπ‘Žπ‘šπ‘šπ‘’π‘’π‘‘π‘‘π‘’π‘’π‘¦π‘¦)

= 𝑛𝑛𝑖𝑖𝑛𝑛

β€’ Sparsity of data is a problem– if π‘Žπ‘Ž is small, the estimate is not accurate– if π‘Žπ‘Žπ‘–π‘– is 0, it will dominate the estimate: we will never predict 𝑣𝑣 if a word that never

appeared in training (with 𝑣𝑣) appears in the test data

Page 78: Generative Models; Naive Bayes

CIS 419/519 Fall’20 82

Robust Estimation of Probabilities

β€’ This process is called smoothing.β€’ There are many ways to do it, some better justified than others;β€’ An empirical issue.

Here: β€’ π‘Žπ‘Žπ‘˜π‘˜ is # of occurrences of the 𝑝𝑝-th feature given the label 𝑣𝑣‒ π‘Žπ‘Ž is # of occurrences of the label 𝑣𝑣‒ 𝑝𝑝 is a prior estimate of 𝑣𝑣 (e.g., uniform)β€’ π‘Žπ‘Ž is equivalent sample size (# of labels)

– Is this a reasonable definition?

𝑃𝑃 π‘Žπ‘Žπ‘˜π‘˜ 𝑣𝑣 =π‘Žπ‘Žπ‘˜π‘˜ + π‘Žπ‘Žπ‘π‘π‘Žπ‘Ž + π‘Žπ‘Ž

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£βˆˆ π‘‘π‘‘π‘–π‘–π‘˜π‘˜π‘’π‘’,π‘‘π‘‘π‘–π‘–π‘¦π‘¦π‘‘π‘‘π‘–π‘–π‘˜π‘˜π‘’π‘’ P v �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘– = π‘ π‘ π΅π΅π‘Žπ‘Žπ‘‘π‘‘π‘–π‘–|𝑣𝑣)

Page 79: Generative Models; Naive Bayes

CIS 419/519 Fall’20 83

Robust Estimation of Probabilitiesβ€’ Smoothing:

β€’ Common values: – Laplace Rule for the Boolean case, 𝑝𝑝 = 1

2,π‘Žπ‘Ž = 2

𝑃𝑃 π‘Žπ‘Žπ‘˜π‘˜ 𝑣𝑣 =π‘Žπ‘Žπ‘˜π‘˜ + π‘Žπ‘Žπ‘π‘π‘Žπ‘Ž + π‘Žπ‘Ž

𝑃𝑃 π‘Žπ‘Žπ‘˜π‘˜ 𝑣𝑣 =π‘Žπ‘Žπ‘˜π‘˜ + 1π‘Žπ‘Ž + 2

Page 80: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’In practice, we will typically work in the log space:β€’ 𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž log𝑃𝑃 𝑣𝑣𝑗𝑗 βˆπ‘–π‘– 𝑃𝑃(π‘Žπ‘Žπ‘–π‘–β”‚π‘£π‘£π‘—π‘— ) =

= π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Ž [ π‘”π‘”π΅π΅π‘Žπ‘Ž 𝑃𝑃(𝑣𝑣𝑗𝑗) + �𝑖𝑖

π‘”π‘”π΅π΅π‘Žπ‘Ž 𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗 )]

85

Another comment on estimation𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 οΏ½

𝑖𝑖

𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗

Page 81: Generative Models; Naive Bayes

CIS 419/519 Fall’20 86

NaΓ―ve Bayes: Two Classes

β€’ Notice that the naΓ―ve Bayes method gives a method for predicting

β€’ rather than an explicit classifier.β€’ In the case of two classes, π‘£π‘£βˆˆ{0,1} we predict that 𝑣𝑣 = 1 iff:

Why do things work?

𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗 = 1

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗 = 0

> 1

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰P vj �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗)

Page 82: Generative Models; Naive Bayes

CIS 419/519 Fall’20 87

NaΓ―ve Bayes: Two Classes

β€’ Notice that the naΓ―ve Bayes method gives a method for predicting rather than an explicit classifier.

β€’ In the case of two classes, π‘£π‘£βˆˆ{0,1} we predict that 𝑣𝑣 = 1 iff:𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1

𝑛𝑛 𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗 = 1𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1

𝑛𝑛 𝑃𝑃 π‘Žπ‘Žπ‘–π‘– 𝑣𝑣𝑗𝑗 = 0> 1

Denote 𝑝𝑝𝑖𝑖 = 𝑃𝑃 π‘Žπ‘Žπ‘–π‘– = 1 𝑣𝑣 = 1 , 𝑒𝑒𝑖𝑖 = 𝑃𝑃 π‘Žπ‘Žπ‘–π‘– = 1 𝑣𝑣 = 0𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1

𝑛𝑛 𝑝𝑝𝑖𝑖π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑝𝑝𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑒𝑒𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑒𝑒𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖> 1

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£π‘—π‘—βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣𝑗𝑗)

Page 83: Generative Models; Naive Bayes

CIS 419/519 Fall’20 88

NaΓ―ve Bayes: Two Classesβ€’ In the case of two classes, π‘£π‘£βˆˆ{0,1} we predict that 𝑣𝑣 = 1 iff:

𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑝𝑝𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑝𝑝𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑒𝑒𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑒𝑒𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖=𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1

𝑛𝑛 (1 βˆ’ 𝑝𝑝𝑖𝑖) 𝑝𝑝𝑖𝑖/1 βˆ’ 𝑝𝑝𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 (1 βˆ’ 𝑒𝑒𝑖𝑖) 𝑒𝑒𝑖𝑖/1 βˆ’ 𝑒𝑒𝑖𝑖 π‘₯π‘₯𝑖𝑖

> 1

Page 84: Generative Models; Naive Bayes

CIS 419/519 Fall’20

𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑝𝑝𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑝𝑝𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑒𝑒𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑒𝑒𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖=𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1

𝑛𝑛 (1 βˆ’ 𝑝𝑝𝑖𝑖) 𝑝𝑝𝑖𝑖/1 βˆ’ 𝑝𝑝𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 (1 βˆ’ 𝑒𝑒𝑖𝑖) 𝑒𝑒𝑖𝑖/1 βˆ’ 𝑒𝑒𝑖𝑖 π‘₯π‘₯𝑖𝑖

> 1

89

NaΓ―ve Bayes: Two Classesβ€’ In the case of two classes, π‘£π‘£βˆˆ{0,1} we predict that 𝑣𝑣 = 1 iff:

Take logarithm; we predict 𝑣𝑣 = 1 iff

π‘”π‘”π΅π΅π‘Žπ‘Žπ‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 = 1𝑃𝑃(𝑣𝑣𝑗𝑗 = 0) + οΏ½

𝑖𝑖

log1 βˆ’ 𝑝𝑝𝑖𝑖1 βˆ’ 𝑒𝑒𝑖𝑖

+ �𝑖𝑖

(log𝑝𝑝𝑖𝑖

1 βˆ’ π‘π‘π‘–π‘–βˆ’ log

𝑒𝑒𝑖𝑖1 βˆ’ 𝑒𝑒𝑖𝑖

)π‘Žπ‘Žπ‘–π‘– > 0

What is the decision function we derived?

Page 85: Generative Models; Naive Bayes

CIS 419/519 Fall’20 90

NaΓ―ve Bayes: Two Classesβ€’ In the case of two classes, v∈{0,1} we predict that v=1 iff:

β€’ We get that naive Bayes is a linear separator with

β€’ If 𝑝𝑝𝑖𝑖 = 𝑒𝑒𝑖𝑖 then 𝑠𝑠𝑖𝑖 = 0 and the feature is irrelevant

𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑝𝑝𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑝𝑝𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 𝑒𝑒𝑖𝑖

π‘₯π‘₯𝑖𝑖 1 βˆ’ 𝑒𝑒𝑖𝑖 1βˆ’π‘₯π‘₯𝑖𝑖=𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 β‹… βˆπ‘–π‘–=1

𝑛𝑛 (1 βˆ’ 𝑝𝑝𝑖𝑖) 𝑝𝑝𝑖𝑖/1 βˆ’ 𝑝𝑝𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑃𝑃 𝑣𝑣𝑗𝑗 = 0 β‹… βˆπ‘–π‘–=1𝑛𝑛 (1 βˆ’ 𝑒𝑒𝑖𝑖) 𝑒𝑒𝑖𝑖/1 βˆ’ 𝑒𝑒𝑖𝑖 π‘₯π‘₯𝑖𝑖

> 1

Take logarithm; we predict 𝑣𝑣 = 1 iff

π‘”π‘”π΅π΅π‘Žπ‘Žπ‘ƒπ‘ƒ 𝑣𝑣𝑗𝑗 = 1𝑃𝑃(𝑣𝑣𝑗𝑗 = 0) + οΏ½

𝑖𝑖

log1 βˆ’ 𝑝𝑝𝑖𝑖1 βˆ’ 𝑒𝑒𝑖𝑖

+ �𝑖𝑖

(log𝑝𝑝𝑖𝑖

1 βˆ’ π‘π‘π‘–π‘–βˆ’ log

𝑒𝑒𝑖𝑖1 βˆ’ 𝑒𝑒𝑖𝑖

)π‘Žπ‘Žπ‘–π‘– > 0

𝑠𝑠𝑖𝑖 = log𝑝𝑝𝑖𝑖

1 βˆ’ π‘π‘π‘–π‘–βˆ’ log

𝑒𝑒𝑖𝑖1 βˆ’ 𝑒𝑒𝑖𝑖

= log𝑝𝑝𝑖𝑖 1 βˆ’ 𝑒𝑒𝑖𝑖𝑒𝑒𝑖𝑖 1 βˆ’ 𝑝𝑝𝑖𝑖

Page 86: Generative Models; Naive Bayes

CIS 419/519 Fall’20 91

NaΓ―ve Bayes: Two Classesβ€’ In the case of two classes we have that:

log𝑃𝑃(𝑣𝑣𝑗𝑗 = 1|π‘Žπ‘Ž)𝑃𝑃(𝑣𝑣𝑗𝑗 = 0|π‘Žπ‘Ž)

= �𝑖𝑖

π’˜π’˜π‘–π‘–π’™π’™π‘–π‘– βˆ’ 𝑏𝑏

β€’ but since𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 π‘Žπ‘Ž = 1 βˆ’ 𝑃𝑃(𝑣𝑣𝑗𝑗 = 0|π‘Žπ‘Ž)

β€’ We get:

𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 π‘Žπ‘Ž =1

1 + exp(βˆ’βˆ‘π‘–π‘–π’˜π’˜π‘–π‘–π’™π’™π‘–π‘– + 𝑏𝑏)β€’ Which is simply the logistic function.

β€’ The linearity of NB provides a better explanation for why it works.

We have: 𝐴𝐴 = 1 βˆ’ 𝐡𝐡; πΏπΏπ΅π΅π‘Žπ‘Ž(𝐡𝐡/𝐴𝐴) = βˆ’πΆπΆ. Then:

πΈπΈπ‘Žπ‘Žπ‘π‘(βˆ’πΆπΆ) = 𝐡𝐡/𝐴𝐴 == (1βˆ’ 𝐴𝐴)/𝐴𝐴 = 1/𝐴𝐴 – 1= 1 + πΈπΈπ‘Žπ‘Žπ‘π‘(βˆ’πΆπΆ) = 1/𝐴𝐴𝐴𝐴 = 1/(1 + πΈπΈπ‘Žπ‘Žπ‘π‘(βˆ’πΆπΆ))

Page 87: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Why Does it Work?

β€’ Learning Theory – Probabilistic predictions are done via Linear Models

– The low expressivity explains Generalization + Robustness

β€’ Expressivity (1: Methodology) – Is there a reason to believe that these hypotheses minimize the empirical error?

β€’ In General, No. (Unless some probabilistic assumptions happen to hold). – But: if the hypothesis does not fit the training data,

β€’ Experimenters will augment set of features (in fact, add dependencies)– But now, this actually follows the Learning Theory Protocol:

β€’ Try to learn a hypothesis that is consistent with the dataβ€’ Generalization will be a function of the low expressivity

β€’ Expressivity (2: Theory)– (a) Product distributions are β€œdense” in the space of all distributions.

β€’ Consequently, the resulting predictor’s error is actually close to optimal classifier.

– (b) NB classifiers cannot express all linear functions; but they converge faster to what they can express.

92

πΈπΈπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ β„Ž =π‘Žπ‘Ž ∈ 𝑆𝑆 β„Ž π‘Žπ‘Ž β‰  1 |

|𝑆𝑆|

Page 88: Generative Models; Naive Bayes

CIS 419/519 Fall’20

What’s Next? (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically?

– Yes; shown earlier(2) If probabilistic hypotheses are actually like other linear functions, can you train them similarly (that is, discriminatively)?

– Yes.– Classification: Logistics regression/Max Entropy– HMM: can be learned as a linear model, e.g., with a version of Perceptron

(Structured Models class)

93

Page 89: Generative Models; Naive Bayes

CIS 419/519 Fall’20 94

Recall: NaΓ―ve Bayes, Two Classesβ€’ In the case of two classes we have:

logP(𝑣𝑣𝑗𝑗 = 1|π‘Žπ‘Ž)𝑃𝑃(𝑣𝑣𝑗𝑗 = 0|π‘Žπ‘Ž)

= �𝑖𝑖

π‘ π‘ π‘–π‘–π‘Žπ‘Žπ‘–π‘– βˆ’ 𝑏𝑏

β€’ but since𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 π‘Žπ‘Ž = 1 βˆ’ 𝑃𝑃(𝑣𝑣𝑗𝑗 = 0|π‘Žπ‘Ž)

β€’ We showed that:

𝑃𝑃 𝑣𝑣𝑗𝑗 = 1 π‘Žπ‘Ž =1

1 + exp(βˆ’βˆ‘π‘–π‘– π‘ π‘ π‘–π‘–π‘Žπ‘Žπ‘–π‘– + 𝑏𝑏)

β€’ Which is simply the logistic (sigmoid) function used in the β€’ neural network representation.

Page 90: Generative Models; Naive Bayes

CIS 419/519 Fall’20 95

What’s Next (1)β€’ (1) If probabilistic hypotheses are actually like other linear

functions, can we interpret the outcome of other linear learning algorithms probabilistically?– Yes

β€’ General recipe– Train a classifier f using your favorite algorithm (Perceptron, SVM,

Winnow, etc). Then:

– Use Sigmoid1/1+exp{-(AwTx + B)} to get an estimate for P(y | x)

– 𝐴𝐴,𝐡𝐡 can be tuned using a held out set that was not used for training.– Done in LBJava, for example

Page 91: Generative Models; Naive Bayes

CIS 419/519 Fall’20

What’s Next (2): Logistic Regressionβ€’ (2) If probabilistic hypotheses are actually like other linear functions, can

you actually train them similarly (that is, discriminatively)?β€’ The logistic regression model assumes the following model:

𝑃𝑃(𝑦𝑦 = +1 | 𝒙𝒙,π’˜π’˜) = [1 + exp βˆ’π‘¦π‘¦ π’˜π’˜π‘‡π‘‡π’™π’™ + 𝑏𝑏 βˆ’1

β€’ This is the same model we derived for naΓ―ve Bayes, only that now we will not assume any independence assumption. We will directly find the best π’˜π’˜.

β€’ Therefore training will be more difficult. However, the weight vector derived will be more expressive.– It can be shown that the naΓ―ve Bayes algorithm cannot represent all linear

threshold functions.– On the other hand, NB converges to its performance faster.

96

How?

Page 92: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Logistic Regression (2)β€’ Given the model:

𝑃𝑃(𝑦𝑦 = +1 | 𝒙𝒙,π’˜π’˜) = [1 + exp βˆ’π‘¦π‘¦ π’˜π’˜π‘‡π‘‡π’™π’™ + 𝑏𝑏 βˆ’1

β€’ The goal is to find the (π’˜π’˜, 𝑏𝑏) that maximizes the log likelihood of the data: {π‘Žπ‘Ž1, π‘Žπ‘Ž2 … π‘Žπ‘Žπ‘Žπ‘Ž}.β€’ We are looking for (π’˜π’˜, 𝑏𝑏) that minimizes the negative log-likelihood

minπ’˜π’˜,𝑏𝑏

βˆ‘1π‘šπ‘š log𝑃𝑃(𝑦𝑦 = +1 | π‘Žπ‘Ž,𝑠𝑠) = minπ’˜π’˜,𝑏𝑏

βˆ‘1π‘šπ‘š log[1 + exp(βˆ’π‘¦π‘¦π‘–π‘–(π’˜π’˜π‘‡π‘‡π’™π’™π‘–π‘– + 𝑏𝑏)]

β€’ This optimization problem is called Logistics Regression.

β€’ Logistic Regression is sometimes called the Maximum Entropy model in the NLP community (since the resulting distribution is the one that has the largest entropy among all those that activate the same features).

97

Page 93: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Logistic Regression (3)β€’ Using the standard mapping to linear separators through the origin, we would like to minimize:

minπ’˜π’˜,𝑏𝑏

βˆ‘1π‘šπ‘š log𝑃𝑃(𝑦𝑦 = +1 | π‘Žπ‘Ž,𝑠𝑠) = minπ’˜π’˜,𝑏𝑏

βˆ‘1π‘šπ‘š log[1 + exp(βˆ’π‘¦π‘¦π‘–π‘–(π’˜π’˜π‘‡π‘‡π’™π’™π‘–π‘– + 𝑏𝑏)]

β€’ To get good generalization, it is common to add a regularization term, and the regularized logistics regression then becomes:

π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  𝑓𝑓 𝑠𝑠 = Β½ 𝑠𝑠𝑇𝑇𝑠𝑠 + 𝐢𝐢 οΏ½1

π‘šπ‘š

log[1 + exp(βˆ’π‘¦π‘¦π‘–π‘–(π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘–π‘–)],

Where 𝐢𝐢 is a user selected parameter that balances the two terms.

98

Empirical lossRegularization term

Page 94: Generative Models; Naive Bayes

CIS 419/519 Fall’20

Comments on Discriminative Learningβ€’ π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  𝑓𝑓 𝑠𝑠 = Β½ 𝑠𝑠𝑇𝑇𝑠𝑠 + 𝐢𝐢 βˆ‘1π‘šπ‘š log[1 + exp(βˆ’π‘¦π‘¦π‘–π‘–(π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘–π‘–)],

Where 𝐢𝐢 is a user selected parameter that balances the two terms.

β€’ Since the second term is the loss functionβ€’ Therefore, regularized logistic regression can be related to other learning methods, e.g., SVMs.

β€’ L1 SVM solves the following optimization problem:π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  𝑓𝑓1(𝑠𝑠) = Β½ 𝑠𝑠𝑇𝑇𝑠𝑠 + 𝐢𝐢 βˆ‘1π‘šπ‘š max(0,1 βˆ’ 𝑦𝑦𝑖𝑖(π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘Žπ‘Ž)

β€’ L2 SVM solves the following optimization problem: π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘ π‘  𝑓𝑓2(𝑠𝑠) = Β½ 𝑠𝑠𝑇𝑇𝑠𝑠 + 𝐢𝐢 βˆ‘1π‘šπ‘š max 0,1 βˆ’ π‘¦π‘¦π‘–π‘–π‘ π‘ π‘‡π‘‡π‘Žπ‘Žπ‘–π‘– 2

99

Empirical lossRegularization term

Page 95: Generative Models; Naive Bayes

CIS 419/519 Fall’20 100

A few more NB examples

Page 96: Generative Models; Naive Bayes

CIS 419/519 Fall’20 101

Example: Learning to Classify Textβ€’ Instance space 𝑋𝑋: Text documentsβ€’ Instances are labeled according to 𝑓𝑓(π‘Žπ‘Ž) = π‘”π‘”π‘Žπ‘Žπ‘π‘π‘…π‘…/π‘‘π‘‘π‘Žπ‘Žπ‘ π‘ π‘”π‘”π‘Žπ‘Žπ‘π‘π‘…π‘…β€’ Goal: Learn this function such that, given a new documentβ€’ you can use it to decide if you like it or notβ€’ How to represent the document ? β€’ How to estimate the probabilities ? β€’ How to classify?

𝑣𝑣𝑁𝑁𝑁𝑁 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘£π‘£βˆˆπ‘‰π‘‰π‘ƒπ‘ƒ 𝑣𝑣 �𝑖𝑖

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣)

Page 97: Generative Models; Naive Bayes

CIS 419/519 Fall’20 102

Document Representationβ€’ Instance space 𝑋𝑋: Text documentsβ€’ Instances are labeled according to 𝑦𝑦 = 𝑓𝑓(π‘Žπ‘Ž) = π‘”π‘”π‘Žπ‘Žπ‘π‘π‘…π‘…/π‘‘π‘‘π‘Žπ‘Žπ‘ π‘ π‘”π‘”π‘Žπ‘Žπ‘π‘π‘…π‘…β€’ How to represent the document ? β€’ A document will be represented as a list of its words β€’ The representation question can be viewed as the generation question

β€’ We have a dictionary of π‘Žπ‘Ž words (therefore 2π‘Žπ‘Ž parameters)β€’ We have documents of size 𝑁𝑁: can account for word position & count β€’ Having a parameter for each word & position may be too much:

– # of parameters: 2 Γ— 𝑁𝑁 Γ— π‘Žπ‘Ž (2 Γ— 100 Γ— 50,000 ~ 107)β€’ Simplifying Assumption:

– The probability of observing a word in a document is independent of its location– This still allows us to think about two ways of generating the document

Page 98: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ We want to compute β€’ π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝑃𝑃(𝑦𝑦|𝐷𝐷) = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝑃𝑃(𝐷𝐷|𝑦𝑦) 𝑃𝑃(𝑦𝑦)/𝑃𝑃(𝐷𝐷) =β€’ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝑃𝑃(𝐷𝐷|𝑦𝑦)𝑃𝑃(𝑦𝑦)

β€’ Our assumptions will go into estimating 𝑃𝑃(𝐷𝐷|𝑦𝑦):1. Multivariate Bernoulli

I. To generate a document, first decide if it’s good (𝑦𝑦 = 1) or bad (𝑦𝑦 = 0).II. Given that, consider your dictionary of words and choose 𝑠𝑠 into your document with

probability 𝑝𝑝(𝑠𝑠 |𝑦𝑦), irrespective of anything else. III. If the size of the dictionary is |𝑉𝑉| = π‘Žπ‘Ž, we can then write

β€’ 𝑃𝑃(𝑑𝑑|𝑦𝑦) = ∏1𝑛𝑛 𝑃𝑃(𝑠𝑠𝑖𝑖 = 1|𝑦𝑦)π‘π‘π‘Žπ‘Ž 𝑃𝑃(𝑠𝑠𝑖𝑖 = 0|𝑦𝑦)1βˆ’π‘π‘π‘–π‘–

β€’ Where: β€’ 𝑝𝑝(𝑠𝑠 = 1/0|𝑦𝑦): the probability that 𝑠𝑠 appears/does-not in a 𝑦𝑦 βˆ’labeled document. β€’ 𝑏𝑏𝑖𝑖 ∈ {0,1} indicates whether word π‘ π‘ π‘Žπ‘Ž occurs in document 𝑑𝑑‒ 2π‘Žπ‘Ž + 2 parameters: β€’ Estimating 𝑃𝑃(π‘ π‘ π‘Žπ‘Ž = 1|𝑦𝑦) and 𝑃𝑃(𝑦𝑦) is done in the ML way as before (counting).

103

Classification via Bayes Rule (B)Parameters:1. Priors: 𝑃𝑃(𝑦𝑦 = 0/1)2. βˆ€ π‘ π‘ π‘Žπ‘Ž ∈ π·π·π‘Žπ‘Žπ‘π‘π‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦π‘π‘(π‘ π‘ π‘Žπ‘Ž = 0/1 |𝑦𝑦 = 0/1)

Page 99: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ We want to compute β€’ π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝑃𝑃(𝑦𝑦|𝐷𝐷) = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝑃𝑃(𝐷𝐷|𝑦𝑦) 𝑃𝑃(𝑦𝑦)/𝑃𝑃(𝐷𝐷) =β€’ = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦ 𝑃𝑃(𝐷𝐷|𝑦𝑦)𝑃𝑃(𝑦𝑦)

β€’ Our assumptions will go into estimating 𝑃𝑃(𝐷𝐷|𝑦𝑦):2. Multinomial

I. To generate a document, first decide if it’s good (𝑦𝑦 = 1) or bad (𝑦𝑦 = 0).II. Given that, place 𝑁𝑁 words into 𝑑𝑑, such that π‘ π‘ π‘Žπ‘Ž is placed with probability 𝑃𝑃(π‘ π‘ π‘Žπ‘Ž|𝑦𝑦), and βˆ‘π‘Žπ‘Ž

𝑁𝑁 𝑃𝑃(π‘ π‘ π‘Žπ‘Ž|𝑦𝑦) =1.

III. The Probability of a document is: 𝑃𝑃(𝑑𝑑|𝑦𝑦) = 𝑁𝑁!/π‘Žπ‘Ž1! β€¦π‘Žπ‘Žπ‘π‘! 𝑃𝑃(𝑠𝑠1|𝑦𝑦)π‘Žπ‘Ž1 …𝑝𝑝(𝑠𝑠𝑝𝑝|𝑦𝑦)π‘Žπ‘Žπ‘π‘

β€’ Where π‘Žπ‘Žπ‘Žπ‘Ž is the # of times π‘ π‘ π‘Žπ‘Ž appears in the document.β€’ Same # of parameters: 2π‘Žπ‘Ž + 2, where π‘Žπ‘Ž = |Dictionary|, but the estimation is done a bit differently. (HW).

104

A Multinomial Model Parameters:1. Priors: 𝑃𝑃(𝑦𝑦 = 0/1)2. βˆ€ π‘ π‘ π‘Žπ‘Ž ∈ π·π·π‘Žπ‘Žπ‘π‘π‘‘π‘‘π‘Žπ‘Žπ΅π΅π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘¦π‘¦π‘π‘(π‘ π‘ π‘Žπ‘Ž = 0/1 |𝑦𝑦 = 0/1)𝑁𝑁 dictionary items are chosen into 𝐷𝐷

Page 100: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ The generative model in these two cases is different

label

AppearDocuments d

Β΅ Β―

Words w

Bernoulli: A binary variable corresponds to a document d and a dictionary word 𝑠𝑠, and it takes the value 1 if 𝑠𝑠appears in d. Document topic/label is governed by a prior Β΅, its topic (label), and the variable in the intersection of the plates is governed by Β΅ and the Bernoulli parameter Β― for the dictionary word w

label

Appear (d)Documents d

Β΅ Β―

Position p

Multinomial: Words do not correspond to dictionary words but to positions (occurrences) in the document d. The internal variable is then π‘Šπ‘Š(𝐷𝐷,𝑃𝑃). These variables are generated from the same multinomial distribution Β―, and depend on the topic/label.

105

Model Representation

Page 101: Generative Models; Naive Bayes

CIS 419/519 Fall’20 106

General NB Scenarioβ€’ We assume a mixture probability model, parameterized by Β΅.β€’ Different components {𝑐𝑐1, 𝑐𝑐2, … π‘π‘π‘˜π‘˜} of the model are parameterize by

disjoint subsets of Β΅.

β€’ The generative story: A document 𝑑𝑑 is created by (1) selecting a component according to the priors, 𝑃𝑃(𝑐𝑐𝑗𝑗 |Β΅), then (2) having the mixture component generate a document according to its

own parameters, with distribution 𝑃𝑃(𝑑𝑑|𝑐𝑐𝑗𝑗 , Β΅)β€’ So we have:

𝑃𝑃 𝑑𝑑 Β΅ = βˆ‘1π‘˜π‘˜ 𝑃𝑃(𝑐𝑐𝑗𝑗|Β΅) 𝑃𝑃(𝑑𝑑|𝑐𝑐𝑗𝑗 , Β΅)β€’ In the case of document classification, we assume a one to one

correspondence between components and labels.

Page 102: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ 𝑋𝑋𝑖𝑖 can be continuousβ€’ We can still use

𝑃𝑃 𝑋𝑋1, … . ,𝑋𝑋𝑛𝑛|π‘Œπ‘Œ = �𝑖𝑖

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ)

β€’ And

𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛) =𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 βˆπ‘–π‘– 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦)

βˆ‘π‘—π‘— 𝑃𝑃(π‘Œπ‘Œ = 𝑦𝑦𝑗𝑗)βˆπ‘–π‘– 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦𝑗𝑗)

107

NaΓ―ve Bayes: Continuous Features

Page 103: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ 𝑋𝑋𝑖𝑖 can be continuousβ€’ We can still use

𝑃𝑃 𝑋𝑋1, … . ,𝑋𝑋𝑛𝑛|π‘Œπ‘Œ = �𝑖𝑖

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ)

β€’ And

𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛) =𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 βˆπ‘–π‘– 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦)

βˆ‘π‘—π‘— 𝑃𝑃(π‘Œπ‘Œ = 𝑦𝑦𝑗𝑗)βˆπ‘–π‘– 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦𝑗𝑗)NaΓ―ve Bayes classifier:

π‘Œπ‘Œ = arg max𝑦𝑦

𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 �𝑖𝑖

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦)

108

NaΓ―ve Bayes: Continuous Features

Page 104: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ 𝑋𝑋𝑖𝑖 can be continuousβ€’ We can still use

𝑃𝑃 𝑋𝑋1, … . ,𝑋𝑋𝑛𝑛|π‘Œπ‘Œ = �𝑖𝑖

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ)

β€’ And

𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛) =𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 βˆπ‘–π‘– 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦)

βˆ‘π‘—π‘— 𝑃𝑃(π‘Œπ‘Œ = 𝑦𝑦𝑗𝑗)βˆπ‘–π‘– 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦𝑗𝑗)NaΓ―ve Bayes classifier:

π‘Œπ‘Œ = arg max𝑦𝑦

𝑃𝑃 π‘Œπ‘Œ = 𝑦𝑦 �𝑖𝑖

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ = 𝑦𝑦)

Assumption: 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ) has a Gaussian distribution

109

NaΓ―ve Bayes: Continuous Features

Page 105: Generative Models; Naive Bayes

CIS 419/519 Fall’20 110

The Gaussian Probability Distributionβ€’ Gaussian probability distribution also called normal distribution.β€’ It is a continuous distribution with pdf:β€’ Β΅ = mean of distribution β€’ 𝜎𝜎2 = variance of distributionβ€’ π‘Žπ‘Ž is a continuous variable (βˆ’βˆž ≀ π‘Žπ‘Ž ≀ ∞)β€’ Probability of π‘Žπ‘Ž being in the range [π‘Žπ‘Ž, 𝑏𝑏] cannot be evaluated

analytically (has to be looked up in a table)

β€’ 𝑝𝑝 π‘Žπ‘Ž = 1𝜎𝜎 2πœ‹πœ‹

π‘…π‘…βˆ’π‘₯π‘₯βˆ’πœ‡πœ‡ 2

2 𝜎𝜎2

p(x) =1

Οƒ 2Ο€eβˆ’ (xβˆ’Β΅ )2

2Οƒ2gaussian

x

Page 106: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ) is Gaussianβ€’ Training: estimate mean and standard deviation

– πœ‡πœ‡π‘–π‘– = 𝐸𝐸 𝑋𝑋𝑖𝑖 π‘Œπ‘Œ = 𝑦𝑦– πœŽπœŽπ‘–π‘–2 = 𝐸𝐸 𝑋𝑋𝑖𝑖 βˆ’ πœ‡πœ‡π‘–π‘– 2 π‘Œπ‘Œ = 𝑦𝑦

111

NaΓ―ve Bayes: Continuous Features

Note that the following slides abuse notation significantly. Since 𝑃𝑃(π‘Žπ‘Ž) = 0 for continues distributions, we think of 𝑃𝑃 (𝑋𝑋 = π‘Žπ‘Ž| π‘Œπ‘Œ = 𝑦𝑦), not as a classic probability distribution, but just as a function 𝑓𝑓(π‘Žπ‘Ž) = 𝑁𝑁(π‘Žπ‘Ž, πœ‡πœ‡, 𝜎𝜎2).𝑓𝑓(π‘Žπ‘Ž) behaves as a probability distribution in the sense that βˆ€ π‘Žπ‘Ž, 𝑓𝑓(π‘Žπ‘Ž) β‰₯ 0 and the values add up to 1. Also, note that 𝑓𝑓(π‘Žπ‘Ž) satisfies Bayes Rule, that is, it is true that:

π‘“π‘“π‘Œπ‘Œ(𝑦𝑦|𝑋𝑋 = π‘Žπ‘Ž) = 𝑓𝑓𝑋𝑋 (π‘Žπ‘Ž|π‘Œπ‘Œ = 𝑦𝑦) π‘“π‘“π‘Œπ‘Œ (𝑦𝑦)/𝑓𝑓𝑋𝑋(π‘Žπ‘Ž)

Page 107: Generative Models; Naive Bayes

CIS 419/519 Fall’20

β€’ 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ) is Gaussianβ€’ Training: estimate mean and standard deviation

– πœ‡πœ‡π‘–π‘– = 𝐸𝐸 𝑋𝑋𝑖𝑖 π‘Œπ‘Œ = 𝑦𝑦– πœŽπœŽπ‘–π‘–2 = 𝐸𝐸 𝑋𝑋𝑖𝑖 βˆ’ πœ‡πœ‡π‘–π‘– 2 π‘Œπ‘Œ = 𝑦𝑦

112

NaΓ―ve Bayes: Continuous Features

𝑋𝑋1 𝑋𝑋2 𝑋𝑋3 π‘Œπ‘Œ2 3 1 1

βˆ’1.2 2 0.4 11.2 0.3 0 02.2 1.1 0 1

Page 108: Generative Models; Naive Bayes

CIS 419/519 Fall’20 113

NaΓ―ve Bayes: Continuous Featuresβ€’ 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œ) is Gaussianβ€’ Training: estimate mean and standard

deviation– πœ‡πœ‡π‘–π‘– = 𝐸𝐸 𝑋𝑋𝑖𝑖 π‘Œπ‘Œ = 𝑦𝑦– πœŽπœŽπ‘–π‘–2 = 𝐸𝐸 𝑋𝑋𝑖𝑖 βˆ’ πœ‡πœ‡π‘–π‘– 2 π‘Œπ‘Œ = 𝑦𝑦

– πœ‡πœ‡1 = 𝐸𝐸 𝑋𝑋1 π‘Œπ‘Œ = 1 = 2+ βˆ’1.2 +2.23

= 1

– 𝜎𝜎12 = 𝐸𝐸 𝑋𝑋1 βˆ’ πœ‡πœ‡1 π‘Œπ‘Œ = 1 =2βˆ’1 2+ βˆ’1.2 βˆ’1 2+ 2.2 βˆ’1 2

3= 2.43

𝑋𝑋1 𝑋𝑋2 𝑋𝑋3 π‘Œπ‘Œ2 3 1 1

βˆ’1.2 2 0.4 11.2 0.3 0 02.2 1.1 0 1

Page 109: Generative Models; Naive Bayes

CIS 419/519 Fall’20 114

Recall: NaΓ―ve Bayes, Two Classesβ€’ In the case of two classes we have that:

log𝑃𝑃(𝑣𝑣 = 1|π‘Žπ‘Ž)𝑃𝑃(𝑣𝑣 = 0|π‘Žπ‘Ž)

= �𝑖𝑖

π’˜π’˜π‘–π‘–π’™π’™π‘–π‘– βˆ’ 𝑏𝑏

β€’ but since𝑃𝑃 𝑣𝑣 = 1 π‘Žπ‘Ž = 1 βˆ’ 𝑃𝑃(𝑣𝑣 = 0|π‘Žπ‘Ž)

β€’ We get:

𝑃𝑃 𝑣𝑣 = 1 π‘Žπ‘Ž =1

1 + exp( βˆ’βˆ‘π‘–π‘–π’˜π’˜π‘–π‘–π’™π’™π‘–π‘– + 𝑏𝑏)

β€’ Which is simply the logistic function (also used in the neural network representation)β€’ The same formula can be written for continuous features

Page 110: Generative Models; Naive Bayes

CIS 419/519 Fall’20

𝑃𝑃 𝑣𝑣 = 1 π‘Žπ‘Ž =1

1 + exp(log𝑃𝑃(𝑣𝑣 = 0|π‘Žπ‘Ž)𝑃𝑃(𝑣𝑣 = 1|π‘Žπ‘Ž))

=1

1 + exp(log𝑃𝑃 𝑣𝑣 = 0 𝑃𝑃(π‘Žπ‘Ž|𝑣𝑣 = 0)𝑃𝑃 𝑣𝑣 = 1 𝑃𝑃(π‘Žπ‘Ž|𝑣𝑣 = 1))

=1

1 + exp(log𝑃𝑃 𝑣𝑣 = 0𝑃𝑃(𝑣𝑣 = 1) + βˆ‘π‘–π‘– log𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣 = 0)

𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣 = 1))

�𝑖𝑖

log𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣 = 0)𝑃𝑃(π‘Žπ‘Žπ‘–π‘–|𝑣𝑣 = 1) =οΏ½

𝑖𝑖

π‘”π‘”π΅π΅π‘Žπ‘Ž

1

2πœ‹πœ‹πœŽπœŽπ‘–π‘–2exp βˆ’ π‘Žπ‘Žπ‘–π‘– βˆ’ πœ‡πœ‡π‘–π‘–0 2

2πœŽπœŽπ‘–π‘–2

1

2πœ‹πœ‹πœŽπœŽπ‘–π‘–2exp βˆ’ π‘Žπ‘Žπ‘–π‘– βˆ’ πœ‡πœ‡π‘–π‘–1 2

2πœŽπœŽπ‘–π‘–2

= �𝑖𝑖

log exp(π‘Žπ‘Žπ‘–π‘– βˆ’ πœ‡πœ‡π‘–π‘–1 2 βˆ’ π‘Žπ‘Žπ‘–π‘– βˆ’ πœ‡πœ‡π‘–π‘–0 2

2 πœŽπœŽπ‘–π‘–2)

= �𝑖𝑖

(πœ‡πœ‡π‘–π‘–0 βˆ’ πœ‡πœ‡π‘–π‘–1

πœŽπœŽπ‘–π‘–2π‘Žπ‘Žπ‘–π‘– +

πœ‡πœ‡π‘–π‘–12 βˆ’ πœ‡πœ‡π‘–π‘–02

2πœŽπœŽπ‘–π‘–2)

β€’ Logistic function for Gaussian features

115

Logistic Function: Continuous Features

Note that we are using ratio of

probabilities, since π‘Žπ‘Ž is a continuous

variable.