Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
CIS 419/519 Fallβ20
Generative Models; Naive Bayes
Dan Roth [email protected]|http://www.cis.upenn.edu/~danroth/|461C, 3401 Walnut
1
Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), and other authors who have made their ML slides available.
CIS 419/519 Fallβ20 2
Administration (11/23/20)β’ Remember that all the lectures are available on the website before the class
β Go over it and be preparedβ A new set of written notes will accompany most lectures, with some more details, examples
and, (when relevant) some code.
β’ HW4 is out β NNs and Bayesian Learningβ Due 12/3 12/7 (Monday)
β’ No Lecture on Wednesday (Friday Schedule)β Happy Thanksgiving!
β’ Projectsβ Most of you have chosen a project and a team. β You should have received an email from one of the TAs that will help guide your project
β’ We have dates for the Final and the Project Submission; check the classβ website
Are we recording? YES!
Available on the web site
CIS 419/519 Fallβ20 3
Projectsβ’ CIS 519 students need to do a team project: Read the project descriptions and follow the updates on the Project webpage
β Teams will be of size 2-4β We will help grouping if needed
β’ There will be 3 options for projects. β Natural Language Processing (Text)β Computer Vision (Images)β Speech (Audio)
β’ In all cases, we will give you datasets and initial ideasβ The problem will be multiclass classification problemsβ You will get annotated data only for some of the labels, but will also have to predict other labelsβ 0-zero shot learning; few-shot learning; transfer learning
β’ A detailed note will come out today.
β’ Timeline:β 11/11 Choose a project and team upβ 11/23 Initial proposal describing what your team plans to do β 12/2 Progress reportβ 12/15-20 (TBD) Final paper + short video
β’ Try to make it interesting!
CIS 419/519 Fallβ20
Recap: Error Driven (discriminative) Learning β’ Consider a distribution π·π· over space πΏπΏΓππβ’ πΏπΏ - the instance space; ππ - set of labels. (e.g. Β±1)
β’ Can think about the data generation process as governed by π·π·(ππ), and the labeling process as governed by π·π·(π¦π¦|ππ), such that
π·π·(ππ, π¦π¦) = π·π·(ππ) π·π·(π¦π¦|ππ)
β’ This can be used to model both the case where labels are generated by a function π¦π¦ = ππ(ππ), as well as noisy cases and probabilistic generation of the label.
β’ If the distribution π·π· is known, there is no learning. We can simply predict π¦π¦ = πππππππππππππ¦π¦ π·π·(π¦π¦|ππ)
β’ If we are looking for a hypothesis, we can simply find the one that minimizes the probability of mislabeling:
β = ππππππππππππβ πΈπΈ(ππ,π¦π¦) ~ π·π·[[β ππ β π¦π¦]]
CIS 419/519 Fallβ20 5
Recap: Error Driven Learning (2)β’ Inductive learning comes into play when the distribution is not known. β’ Then, there are two basic approaches to take.
β Discriminative (Direct) Learning
β’ and β Bayesian Learning (Generative)
β’ Running example: β Text Correction:
β’ βI saw the girl it the parkβ I saw the girl in the park
CIS 419/519 Fallβ20 6
1: Direct Learningβ’ Model the problem of text correction as a problem of learning from examples.β’ Goal: learn directly how to make predictions.
PARADIGMβ’ Look at many (positive/negative) examples.β’ Discover some regularities in the data.β’ Use these to construct a prediction policy.
β’ A policy (a function, a predictor) needs to be specific.[it/in] rule: if βtheβ occurs after the target βin
β’ Assumptions comes in the form of a hypothesis class.
Bottom line: approximating β βΆ πΏπΏ β ππ is estimating ππ(ππ|πΏπΏ).
CIS 419/519 Fallβ20
β’ Consider a distribution π·π· over space πΏπΏΓππβ’ πΏπΏ - the instance space; ππ - set of labels. (e.g. Β±1)β’ Given a sample ππ, π¦π¦ 1
ππ,, and a loss function πΏπΏ(ππ, π¦π¦)
β’ Find ββπ»π» that minimizes βππ=1,πππ·π·(ππππ ,π¦π¦ππ)πΏπΏ(β(ππππ),π¦π¦ππ) + π π π π ππ
πΏπΏ can be: πΏπΏ(β(ππ), π¦π¦) = 1, β(ππ)β π¦π¦, otherwise πΏπΏ β ππ , π¦π¦ = 0 (0-1 loss)
πΏπΏ(β(ππ), π¦π¦) = β ππ β π¦π¦ 2 (L2 loss)
πΏπΏ(β(ππ), π¦π¦) = max{0,1 β π¦π¦ β(ππ)} (hinge loss)
πΏπΏ(β(ππ),π¦π¦) = exp{β π¦π¦β(ππ)} (exponential loss)
β’ Guarantees: If we find an algorithm that minimizes loss on the observed data, then learning theory guarantees good future behavior (as a function of |π»π»|).
7
Direct Learning (2)
CIS 419/519 Fallβ20
The model is called βgenerativeβ since it makes an assumption on how data πΏπΏ is generated given π¦π¦
8
2: Generative Modelβ’ Model the problem of text correction as that of generating correct sentences.β’ Goal: learn a model of the language; use it to predict.
PARADIGMβ’ Learn a probability distribution over all sentences
β In practice: make assumptions on the distributionβs typeβ’ Use it to estimate which sentence is more likely.
β Pr πΌπΌ π π πππ π π‘π‘βπ π ππππππππ πππ‘π‘ π‘π‘βπ π ππππππππ >< Pr(πΌπΌ π π πππ π π‘π‘βπ π ππππππππ ππππ π‘π‘βπ π ππππππππ)β In practice: a decision policy depends on the assumptions
β’ Guarantees: Provided if we assume the βrightβ family of probability distributions
Bottom line: the generating paradigm approximates ππ(πΏπΏ,ππ) = ππ(πΏπΏ|ππ) ππ(ππ)
CIS 419/519 Fallβ20
β’ Note that there are two different notions.β’ Learning probabilistic concepts
β The learned concept is a function h:ππβ[0,1]β h(ππ) may be interpreted as the probability that the label 1 is assigned to ππβ The learning theory that we have studied before is applicable (with some
extensions).β’ Bayesian Learning: Use of a probabilistic criterion in selecting a hypothesis
β The hypothesis can be deterministic, a Boolean function.β’ Itβs not the hypothesis β itβs the process.β’ In practice, as weβll see, we will use the same principles as before, often
similar justifications, and similar algorithms.9
Probabilistic Learning
Remind yourself of basic probability
CIS 419/519 Fallβ20
Probabilitiesβ’ 30 years of AI research danced around the fact that the world was
inherently uncertainβ’ Bayesian Inference:
β Use probability theory and information about independence β Reason diagnostically (from evidence (effects) to conclusions (causes))...β ...or causally (from causes to effects)
β’ Probabilistic reasoning only gives probabilistic resultsβ i.e., it summarizes uncertainty from various sources
β’ We will only use it as a tool in Machine Learning.
CIS 419/519 Fallβ20 11
Conceptsβ’ Probability, Probability Space and Eventsβ’ Joint Eventsβ’ Conditional Probabilitiesβ’ Independence
β This weekβs recitation will provide a refresher on probabilityβ Use the material we provided on-line
β’ Next I will give a very quick refresher
CIS 419/519 Fallβ20 12
(1) Discrete Random Variablesβ’ Let ππ denote a random variable
β ππ is a mapping from a space of outcomes to π π .β’ [ππ = ππ] represents an event (a possible outcome) and, β’ Each event has an associated probability
β’ Examples of binary random variables:β π΄π΄ = I have a headacheβ π΄π΄ = Sally will be the US president in 2020
β’ ππ(π΄π΄ = πππππππ π ) is βthe fraction of possible worlds in which π΄π΄ is trueββ We could spend hours on the philosophy of this, but we wonβt
Adapted from slide by Andrew Moore
CIS 419/519 Fallβ20 13
Visualizing Aβ’ Universe ππ is the event space of all possible worlds
β Its area is 1β ππ(ππ) = 1
β’ ππ(π΄π΄) = area of red ovalβ’ Therefore:
ππ π΄π΄ + ππ Β¬π΄π΄ = 1ππ Β¬π΄π΄ = 1 β ππ π΄π΄
ππ
worlds in which π΄π΄ is false
worlds in which π΄π΄ is true
CIS 419/519 Fallβ20 14
Axioms of Probabilityβ’ Kolmogorov showed that three simple axioms lead to the rules of probability
theoryβ de Finetti, Cox, and Carnap have also provided compelling arguments for these axioms
1. All probabilities are between 0 and 1: 0 β€ ππ(π΄π΄) β€ 1
2. Valid propositions (tautologies) have probability 1, and unsatisfiable propositions have probability 0:
ππ(π‘π‘πππππ π ) = 1 ; ππ(πππππππ π π π ) = 0
3. The probability of a disjunction is given by:ππ(π΄π΄ β¨ π΅π΅) = ππ(π΄π΄) + ππ(π΅π΅) β ππ(π΄π΄ β§ π΅π΅)
CIS 419/519 Fallβ20 15
Interpreting the Axiomsβ’ 0 β€ ππ(π΄π΄) β€ 1β’ ππ(π‘π‘πππππ π ) = 1β’ ππ(πππππππ π π π ) = 0β’ ππ(π΄π΄ β¨ π΅π΅) = ππ(π΄π΄) + ππ(π΅π΅) β ππ(π΄π΄ β§ π΅π΅)
β’ From these you can prove other properties:β’ ππ Β¬π΄π΄ = 1 β ππ π΄π΄β’ ππ π΄π΄ = ππ π΄π΄ β§ π΅π΅ + ππ(π΄π΄ β§ Β¬π΅π΅)
π΄π΄β§π΅π΅π΄π΄ π΅π΅
CIS 419/519 Fallβ20 16
Multi-valued Random Variables
β’ π΄π΄ is a random variable with arity ππ if it can take on exactly one value out of {π£π£1,π£π£2, β¦ , π£π£ππ }
β’ Think about tossing a dieβ’ Thusβ¦
β ππ(π΄π΄ = π£π£ππ β§ π΄π΄ = π£π£ππ ) = 0 ππππ ππ β ππβ ππ(π΄π΄ = π£π£1 β¨ π΄π΄ = π£π£2 β¨ β¦β¨ π΄π΄ = π£π£ππ ) = 1
Based on slide by Andrew Moore
1 = οΏ½ππ=1
ππ
ππ(π΄π΄ = π£π£ππ)
CIS 419/519 Fallβ20 17
Multi-valued Random Variables
β’ We can also show that:ππ π΅π΅ = ππ( π΅π΅ β§ π΄π΄ = π£π£1 β¨ π΄π΄ = π£π£2 β¨ β―β¨ π΄π΄ = π£π£ππ )
β’ This is called marginalization over A
P(π΅π΅) = οΏ½ππ=1
ππ
ππ(π΅π΅ β§ π΄π΄ = π£π£ππ)
CIS 419/519 Fallβ20
(2) Joint Probabilitiesβ’ Joint probability: matrix of combined probabilities of a set of
variables
Russell & Norvigβs Alarm Domain: (boolean RVs)β’ A world has a specific instantiation of variables:
(ππππππππππ β§ πππππππππππππππ¦π¦ β§ Β¬π π πππππ‘π‘βπππππππππ π )β’ The joint probability is given by:
ππ(ππππππππππ, πππππππππππππππ¦π¦) = ππππππππππ Β¬πππππππππππππππππππππππππ¦π¦ 0.09 0.01
Β¬πππππππππππππππ¦π¦ 0.1 0.8
Probability of burglary:ππ(πππππππππππππππ¦π¦) = 0.1
by marginalization over ππππππππππ
CIS 419/519 Fallβ20 19
The Joint Distributionβ’ Recipe for making a joint
distribution of ππ variables:
Slide Β© Andrew Moore
e.g., Boolean variables A, B, C
CIS 419/519 Fallβ20 20
The Joint Distributionβ’ Recipe for making a joint
distribution of ππ variables:1. Make a truth table listing all
combinations of values of your variables (if there are ππ Boolean variables then the table will have 2ππ rows).
A B C0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Slide Β© Andrew Moore
e.g., Boolean variables A, B, C
CIS 419/519 Fallβ20 21
The Joint Distributionβ’ Recipe for making a joint
distribution of ππ variables:1. Make a truth table listing all
combinations of values of your variables (if there are ππBoolean variables then the table will have 2ππ rows).
2. For each combination of values, say how probable it is.
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Slide Β© Andrew Moore
e.g., Boolean variables A, B, C
CIS 419/519 Fallβ20 22
The Joint Distributionβ’ Recipe for making a joint
distribution of d variables:1. Make a truth table listing all
combinations of values of your variables (if there are dBoolean variables then the table will have 2d rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C
0.050.25
0.100.05
0.05
0.10
0.100.30
e.g., Boolean variables A, B, C
Slide Β© Andrew Moore
CIS 419/519 Fallβ20
Inferring Probabilities from the Joint
β’ ππ ππππππππππ = βππ,ππ ππ(ππππππππππ β§ πππππππππππππππ¦π¦ = ππ β§ π π πππππ‘π‘βπππππππππ π = π π )= 0.01 + 0.08 + 0.01 + 0.09 = 0.19
β’ ππ πππππππππππππππ¦π¦ = βππ,ππ ππ(ππππππππππ = ππ β§ πππππππππππππππ¦π¦ β§ π π πππππ‘π‘βπππππππππ π = π π )= 0.01 + 0.08 + 0.001 + 0.009 = 0.1
alarm Β¬alarmearthquake Β¬earthquake earthquake Β¬earthquake
burglary 0.01 0.08 0.001 0.009Β¬burglary 0.01 0.09 0.01 0.79
CIS 419/519 Fallβ20 24
(3) Conditional Probabilityβ’ ππ(π΄π΄ | π΅π΅) = Fraction of worlds in which π΅π΅ is true that also
have A trueU
AB
What if we already know that π΅π΅ is true?
That knowledge changes the probability of Aβ’ Because we know weβre in a
world where B is true
ππ π΄π΄ π΅π΅ =ππ π΄π΄ β§ π΅π΅ππ π΅π΅
ππ π΄π΄ β§ π΅π΅ = ππ π΄π΄ π΅π΅ Γ ππ(π΅π΅)
CIS 419/519 Fallβ20 25
Example: Conditional Probabilities
alarm Β¬alarmburglary 0.09 0.01Β¬burglary 0.1 0.8
ππ(πππππππππππππππ¦π¦ | ππππππππππ)
ππ(ππππππππππ | πππππππππππππππ¦π¦)
ππ(πππππππππππππππ¦π¦ β§ ππππππππππ)
= ππ(πππππππππππππππ¦π¦ β§ ππππππππππ) / ππ(ππππππππππ)= 0.09 / 0.19 = 0.47
= ππ(πππππππππππππππ¦π¦ β§ ππππππππππ) / ππ(πππππππππππππππ¦π¦)= 0.09/0.1 = 0.9
= ππ(πππππππππππππππ¦π¦ | ππππππππππ) ππ(ππππππππππ)= 0.47 β 0.19 = 0.09
ππ π΄π΄ π΅π΅ =ππ π΄π΄ β§ π΅π΅ππ π΅π΅
ππ π΄π΄ β§ π΅π΅ = ππ π΄π΄ π΅π΅ Γ ππ(π΅π΅)
ππ(ππππππππππ, πππππππππππππππ¦π¦) =
CIS 419/519 Fallβ20 26
(4) Independenceβ’ When two event do not affect each othersβ probabilities, we
call them independent β’ Formal definition:
π΄π΄ β«« π΅π΅ β ππ π΄π΄ β§ π΅π΅ = ππ π΄π΄ Γ ππ π΅π΅β ππ π΄π΄ π΅π΅ = ππ(π΄π΄)
CIS 419/519 Fallβ20
β’ Is smart independent of study?
β’ Is prepared independent of study?
27
Exercise: Independence
P(smart β§ study β§ prep)smart Β¬ smart
study Β¬ study study Β¬ study
prepared 0.432 0.16 0.084 0.008
Β¬prepared 0.048 0.16 0.036 0.072
CIS 419/519 Fallβ20 28
Exercise: Independence
Is smart independent of study?ππ π π π‘π‘πππππ¦π¦ β§ π π πππππππ‘π‘ = 0.432 + 0.048 = 0.48ππ π π π‘π‘πππππ¦π¦ = 0.432 + 0.048 + 0.084 + 0.036 = 0.6ππ π π πππππππ‘π‘ = 0.432 + 0.048 + 0.16 + 0.16 = 0.8
ππ(π π π‘π‘πππππ¦π¦) Γ ππ(π π πππππππ‘π‘) = 0.6 Γ 0.8 = 0.48Is prepared independent of study?
P(smart β§ study β§ prep)smart Β¬smart
study Β¬ study study Β¬ study
prepared 0.432 0.16 0.084 0.008
Β¬prepared 0.048 0.16 0.036 0.072
So yes!
CIS 419/519 Fallβ20
β’ Absolute independence of A and B:π΄π΄ β«« π΅π΅ β ππ π΄π΄ β§ π΅π΅ = ππ π΄π΄ Γ ππ π΅π΅
β ππ π΄π΄ π΅π΅ = ππ(π΄π΄)β’ Conditional independence of π΄π΄ and π΅π΅ given πΆπΆ
β’ This lets us decompose the joint distribution:β Conditional independence is different than absolute independence, but
still useful in decomposing the full jointβ ππ(π΄π΄,π΅π΅,πΆπΆ) = [Always] ππ(π΄π΄,π΅π΅|πΆπΆ) ππ(πΆπΆ) = β = [Independence] ππ(π΄π΄|πΆπΆ) ππ(π΅π΅|πΆπΆ) ππ(πΆπΆ)
π΄π΄ β«« π΅π΅| C β ππ π΄π΄ β§ π΅π΅ πΆπΆ = ππ π΄π΄ πΆπΆ Γ ππ(π΅π΅|πΆπΆ)
29
Conditional Independence
CIS 419/519 Fallβ20 30
Exercise: Independence
β’ Is smart independent of study?
β’ Is prepared independent of study?
P(smart β§ study β§ prep)smart Β¬ smart
study Β¬ study study Β¬ study
prepared 0.432 0.16 0.084 0.008
Β¬ prepared 0.048 0.16 0.036 0.072
CIS 419/519 Fallβ20 31
Take Home Exercise: Conditional independence
β’ Is smart conditionally independent of prepared, given study?
β’ Is study conditionally independent of prepared, given smart?
P(smart β§ study β§ prep)smart Β¬ smart
study Β¬ study study Β¬ study
prepared 0.432 0.16 0.084 0.008
Β¬ prepared 0.048 0.16 0.036 0.072
CIS 419/519 Fallβ20 32
Summary: Basic Probabilityβ’ Product Rule: ππ(π΄π΄,π΅π΅) = ππ(π΄π΄|π΅π΅)ππ(π΅π΅) = ππ(π΅π΅|π΄π΄)ππ(π΄π΄)β’ If π΄π΄ and π΅π΅ are independent:
β ππ(π΄π΄,π΅π΅) = ππ(π΄π΄)ππ(π΅π΅); ππ(π΄π΄|π΅π΅) = ππ(π΄π΄),ππ(π΄π΄|π΅π΅,πΆπΆ) = ππ(π΄π΄|πΆπΆ)β’ Sum Rule: P(A β¨ π΅π΅) = ππ(π΄π΄) + ππ(π΅π΅) β ππ(π΄π΄,π΅π΅)β’ Bayes Rule: ππ(π΄π΄|π΅π΅) = ππ(π΅π΅|π΄π΄) ππ(π΄π΄)/ππ(π΅π΅)β’ Total Probability:
β If events π΄π΄1,π΄π΄2, β¦π΄π΄ππ are mutually exclusive: π΄π΄ππ β§β© π΄π΄ππ = Ξ¦, βππ ππ(π΄π΄ππ) = 1β ππ π΅π΅ = β ππ π΅π΅ ,π΄π΄ππ = βππ ππ(π΅π΅|π΄π΄ππ) ππ(π΄π΄ππ)
β’ Total Conditional Probability: β If events π΄π΄1,π΄π΄2, β¦π΄π΄ππ are mutually exclusive: π΄π΄ππ β§β© π΄π΄ππ = Ξ¦, βππ ππ(π΄π΄ππ) = 1β ππ π΅π΅ πΆπΆ = β ππ π΅π΅ ,π΄π΄ππ πΆπΆ = βππ ππ(π΅π΅|π΄π΄ππ ,πΆπΆ) ππ(π΄π΄ππ|πΆπΆ)
CIS 419/519 Fallβ20 33
Summary: The Monty Hall problemβ’ Suppose you're on a game show, and you're
given the choice of three doors.β Behind one door is a car; behind the others,
goats. β You pick a door, say No. 1, and the host, who
knows what's behind the doors, opens another door, say No. 3, which has a goat.
β He then says to you, "Do you want to switch to door No. 2?"
β’ Is it to your advantage to switch your choice?β’ Try to develop a rigorous argument using basic
probability theory. β Very similar to your quiz questions.β Use Bayes Rule
CIS 419/519 Fallβ20
β’ Exactly the process we just usedβ’ The most important formula in probabilistic machine learning
ππ π΄π΄ π΅π΅ =ππ(π΅π΅|π΄π΄) Γ ππ π΄π΄
ππ π΅π΅
(Super Easy) Derivation:ππ π΄π΄ β§ π΅π΅ = ππ π΄π΄ π΅π΅ Γ ππ π΅π΅ππ π΅π΅ β§ π΄π΄ = ππ π΅π΅ π΄π΄ Γ ππ π΄π΄
Just set equal...
and solve...Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
34
Bayesβ Rule
ππ π΄π΄ π΅π΅ Γ ππ π΅π΅ = ππ π΅π΅ π΄π΄ Γ ππ(π΄π΄)
π·π· π¨π¨ β§ π©π© = P(A, B)
CIS 419/519 Fallβ20 38
Bayesβ Rule for Machine Learning
β’ Allows us to reason from evidence to hypothesesβ’ Another way of thinking about Bayesβ rule:
CIS 419/519 Fallβ20 39
Basics of Bayesian Learningβ’ Goal: find the best hypothesis from some space π»π» of hypotheses, given the
observed data (evidence) π·π·.
β’ Define best to be: most probable hypothesis in π»π»β’ In order to do that, we need to assume a probability distribution over the class π»π».
β’ In addition, we need to know something about the relation between the data observed and the hypotheses
β Think about tossing a coin β the probability of getting a Head is related to the Observation
β As we will see, we will be Bayesian about other things, e.g., the parameters of the model
CIS 419/519 Fallβ20 40
Basics of Bayesian Learningβ’ ππ(β) - the prior probability of a hypothesis β
Reflects background knowledge; before data is observed. If no information - uniform distribution.
β’ ππ(π·π·) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis)
β’ ππ(π·π·|β): The probability of observing the sample π·π·, given that hypothesis β is the target
β’ ππ(β|π·π·): The posterior probability of β. The probability that β is the target, given that π·π· has been observed.
CIS 419/519 Fallβ20 41
Bayes Theorem
ππ β π·π· =ππ(π·π·|β)ππ β
ππ(π·π·)β’ ππ(β|π·π·) increases with ππ(β) and with ππ(π·π·|β)
β’ ππ(β|π·π·) decreases with ππ(π·π·)
CIS 419/519 Fallβ20 42
Learning Scenarioππ(β|π·π·) = ππ(π·π·|β) ππ(β)/ππ(π·π·)
β’ The learner considers a set of candidate hypotheses π»π» (models), and attempts to find the most probable one β β π»π», given the observed data.
β’ Such maximally probable hypothesis is called maximum a posteriorihypothesis (MAP); Bayes theorem is used to compute it:
βπππ΄π΄ππ = ππππππππππππββπ»π» ππ(β|π·π·) = ππππππππππππββπ»π» ππ(π·π·|β) ππ(β)/ππ(π·π·)
= ππππππππππππββπ»π» ππ(π·π·|β) ππ(β)
CIS 419/519 Fallβ20
βπππ΄π΄ππ = ππππππππππππββπ»π» ππ β π·π· = ππππππππππππββπ»π»ππ(π·π·|β) ππ(β)
β’ We may assume that a priori, hypotheses are equally probable: ππ βππ = ππ βππ β βππ,βππ β π»π»
β’ We get the Maximum Likelihood hypothesis:
βπππΏπΏ = ππππππππππππββπ»π» ππ(π·π·|β)
β’ Here we just look for the hypothesis that best explains the data
43
Learning Scenario (2)
CIS 419/519 Fallβ20 44
Examplesβ’ βππππππ = ππππππππππππββπ»π» ππ(β|π·π·) = ππππππππππππββπ»π»ππ(π·π·|β) ππ(β)
β’ A given coin is either fair or has a 60% bias in favor of Head.β’ Decide what is the bias of the coin [This is a learning problem!]β’ Two hypotheses: β1: ππ(π»π») = 0.5; β2: ππ(π»π») = 0.6
Prior: ππ(β): ππ(β1) = 0.75 ππ(β2 ) = 0.25Now we need Data. 1st Experiment: coin toss came out π»π».β ππ(π·π·|β):
ππ(π·π·|β1) = 0.5 ; ππ(π·π·|β2) = 0.6β ππ(π·π·):
ππ(π·π·) = ππ(π·π·|β1)ππ(β1) + ππ(π·π·|β2)ππ(β2 )= 0.5 Γ 0.75 + 0.6 Γ 0.25 = 0.525
β ππ(β|π·π·):ππ(β1|π·π·) = ππ(π·π·|β1)ππ(β1)/ππ(π·π·) = 0.5 Γ 0.75/0.525 = 0.714ππ(β2|π·π·) = ππ(π·π·|β2)ππ(β2)/ππ(π·π·) = 0.6 Γ 0.25/0.525 = 0.286
CIS 419/519 Fallβ20 45
Examples(2)β’ βππππππ = ππππππππππππββπ»π» ππ(β|π·π·) = ππππππππππππββπ»π»ππ(π·π·|β) ππ(β)
β’ A given coin is either fair or has a 60% bias in favor of Head.β’ Decide what is the bias of the coin [This is a learning problem!]
β’ Two hypotheses: β1: ππ(π»π») = 0.5; β2: ππ(π»π») = 0.6β Prior: ππ(β): ππ(β1) = 0.75 ππ(β2) = 0.25
β’ After 1st coin toss is π»π» we still think that the coin is more likely to be fair
β’ If we were to use Maximum Likelihood approach (i.e., assume equal priors) we would think otherwise. The data supports the biased coin better.
β’ Try: 100 coin tosses; 70 heads. β’ You will believe that the coin is biased.
CIS 419/519 Fallβ20 46
Examples(2)βππππππ = ππππππππππππββπ»π» ππ(β|π·π·) = ππππππππππππββπ»π»ππ(π·π·|β) ππ(β)
β’ A given coin is either fair or has a 60% bias in favor of Head.β’ Decide what is the bias of the coin [This is a learning problem!]
β’ Two hypotheses: β1: ππ(π»π») = 0.5; β2: ππ(π»π») = 0.6β Prior: ππ(β): ππ(β1) = 0.75 ππ(β2 ) = 0.25
β’ Case of 100 coin tosses; 70 heads.
ππ(π·π·) = ππ(π·π·|β1) ππ(β1) + ππ(π·π·|β2) ππ(β2) == 0.5100 Γ 0.75 + 0.670 Γ 0.430 Γ 0.25 =
= 7.9 Γ 10β31 Γ 0.75 + 3.4 Γ 10β28 Γ 0.25
0.0057 = ππ(β1|π·π·) = ππ(π·π·|β1) ππ(β1)/ππ(π·π·) << ππ(π·π·|β2) ππ(β2) /ππ(π·π·) = ππ(β2|π·π·) = 0.9943
In this example we had to choose one hypothesis out of two possibilities; realistically, we could have infinitely many to choose from. We will still follow the maximum likelihood principle.
CIS 419/519 Fallβ20 48
CIS 419/519 Fallβ20 49
CIS 419/519 Fallβ20 50
Maximum Likelihood Estimateβ’ Assume that you toss a (ππ, 1 β ππ) coin ππ times and get ππ Heads,
ππ β ππ Tails. What is ππ?
β’ If ππ is the probability of Head, the probability of the data observed is:
ππ π·π· ππ = ππππ 1β ππ ππβππ
β’ The log Likelihood:
πΏπΏ(ππ) = logππ(π·π·|ππ) = ππ log(ππ) + (ππ β ππ)log(1 β ππ)
β’ To maximize, set the derivative w.r.t. ππ equal to 0:
πππΏπΏ ππππππ
=ππππ
βππβ ππ1 β ππ
β’ Solving this for ππ, gives: ππ = ππππ
2. In practice, smoothing is advisable βderiving the right smoothing can be done by assuming a prior.
1. The model we assumed is binomial. You could assume a different model! Next we will consider other models and see how to learn their parameters.
CIS 419/519 Fallβ20 51
Probability Distributionsβ’ Bernoulli Distribution:
β Random Variable ππ takes values {0, 1} s.t ππ(ππ = 1) = ππ = 1 β ππ(ππ = 0)β (Think of tossing a coin)
β’ Binomial Distribution: β Random Variable ππ takes values {1, 2, β¦ ,ππ} representing the number of
successes (ππ = 1) in ππ Bernoulli trials.β ππ ππ = ππ = ππ ππ,ππ, ππ = πΆπΆππππ ππππ 1 β ππ ππβππ
β Note that if ππ ~ π΅π΅πππππ΅π΅ππ(ππ,ππ) and ππ ~ π΅π΅π π πππππ΅π΅ππππππππ ππ , ππ = βππ=1ππ ππ
β (Think of multiple coin tosses)
CIS 419/519 Fallβ20 52
Probability Distributions(2)β’ Categorical Distribution:
β Random Variable ππ takes on values in {1,2, β¦ππ} s.t ππ(ππ = ππ) = ππππ and β1ππ ππππ = 1β (Think of a dice)
β’ Multinomial Distribution:β Let the random variables ππππ (ππ = 1, 2, β¦ , ππ) indicates the number of times outcome
ππ was observed over the ππ trials. β The vector ππ = (ππ1, β¦ ,ππππ) follows a multinomial distribution (ππ, ππ) where ππ =
(ππ1, β¦ , ππππ) and β1ππ ππππ = 1β ππ ππ1, ππ2, β¦ ππππ ,ππ, ππ = ππ ππ1 = ππ1, β¦ ππππ = ππππ
= ππ!π₯π₯1!β¦π₯π₯ππ!
ππ1π₯π₯1 β¦ππππ
π₯π₯ππ π π βπ π ππ βππ=1ππ ππππ = ππ
β (Think of ππ tosses of a ππ sided dice)
CIS 419/519 Fallβ20
Our eventual goal will be: Given a document, predict whether itβs βgoodβ or βbadβ
53
An Application: A Multinomial Bag of Words
β’ We are given a collection of documents written in a three word language {ππ, ππ, ππ}. All the documents have exactly ππ words (each word can be either ππ,ππ π΅π΅ππ ππ).
β’ We are given a labeled document collection of size m, {π·π·1,π·π·2 β¦ ,π·π·ππ}. The label π¦π¦ππ of document π·π·ππ is 1 or 0, indicating whether π·π·ππ is βgoodβ or βbadβ.
β’ Our generative model uses the multinominal distribution. It first decides whether to generate a good or a bad document (with ππ(π¦π¦
ππ= 1) = ππ). Then, it places words in the document; let ππππ (ππππ, ππππ, resp.) be the number of
times word ππ (ππ, ππ, resp.) appears in document π·π·ππ. That is, we have ππππ + ππππ + ππππ = |π·π·ππ| = ππ.β’ In this generative model, we have:
ππ(π·π·ππ|π¦π¦ = 1) = ππ!/(ππππ! ππππ! ππππ!) πΌπΌ1ai π½π½1
bi πΎπΎ1ci
where πΌπΌ1 (π½π½1, πΎπΎ1 resp.) is the probability that ππ (ππ , ππ) appears in a βgoodβ document. β’ Similarly, ππ(π·π·ππ|π¦π¦ = 0) = ππ!/(ππππ! ππππ! ππππ!) πΌπΌ 0
ai π½π½0bi πΎπΎ0
ci
β’ Note that: πΌπΌ0 + π½π½0 + πΎπΎ0 = πΌπΌ1 + π½π½1 + πΎπΎ1 = 1
Unlike the discriminative case, the βgameβ here is different: We make an assumption on how the data is being generated.
(multinomial, with ππ,πΌπΌππ ,π½π½ππ, πΎπΎππ ) We observe documents, and estimate these parameters (thatβs the learning problem). Once we have the parameters, we can predict the corresponding label.
How do we model it?What is the learning problem?
CIS 419/519 Fallβ20
A Multinomial Bag of Words (2)β’ We are given a collection of documents written in a three word language {ππ, ππ, ππ}. All the documents have
exactly ππ words (each word can be either ππ,ππ π΅π΅ππ ππ). β’ We are given a labeled document collection {π·π·1,π·π·2 β¦ ,π·π·ππ}. The label π¦π¦ππ of document π·π·ππ is 1 or 0, indicating
whether π·π·ππ is βgoodβ or βbadβ.
β’ The classification problem: given a document π·π·, determine if it is good or bad; that is, determine ππ(π¦π¦|π·π·).
β’ This can be determined via Bayes rule: ππ(π¦π¦|π·π·) = ππ(π·π·|π¦π¦) ππ(π¦π¦)/ππ(π·π·)
β’ But, we need to know the parameters of the model to compute that.
CIS 419/519 Fallβ20
β’ How do we estimate the parameters?β’ We derive the most likely value of the parameters defined above, by maximizing the log
likelihood of the observed data. β’ πππ·π· = βππ ππ(π¦π¦ππ ,π·π·ππ ) = βππ ππ(π·π·ππ |π¦π¦ππ ) ππ(π¦π¦ππ) =
β We denote by ππ(π¦π¦ππ = 1) = ππ the probability that an example is βgoodβ (π¦π¦ππ = 1; otherwise π¦π¦ππ =0). Then:
οΏ½ππ
ππ(π¦π¦,π·π·ππ ) = οΏ½ππ
[ ππππ!
ππππ! ππππ! ππππ!πΌπΌ1
ai π½π½1bi πΎπΎ1
πππππ¦π¦ππ β 1 β ππ
ππ!ππππ! ππππ! ππππ!
πΌπΌ 0ai π½π½0
bi πΎπΎ0ci
1βπ¦π¦ππ]
β’ We want to maximize it with respect to each of the parameters. We first compute log(πππ·π·) and then differentiate:
β’ log(πππ·π·) = βπππ¦π¦ππ [ log(ππ) + πΆπΆ + ππππ log(πΌπΌ1) + ππππ log(π½π½1) + ππππ log(πΎπΎ1)] +
(1 β π¦π¦ππ) [log(1 β ππ) + πΆπΆπΆ + ππππ log(πΌπΌ0) + ππππ log(π½π½0) + ππππ log(πΎπΎ0) ]
β’ πππππππππππ·π·ππ
= βππ[π¦π¦ππππβ 1βπ¦π¦ππ
1 βππ] = 0 β βππ (π¦π¦ππ β ππ) = 0 β ππ = βππ
π¦π¦ππππ
β’ The same can be done for the other 6 parameters. However, notice that they are not independent: πΌπΌ0+ π½π½0+ πΎπΎ0=πΌπΌ1+ π½π½1+ πΎπΎ1 =1 and also ππππ + ππππ + ππππ = |π·π·ππ| = ππ.
55
A Multinomial Bag of Words (3)
Labeled data, assuming that the examples are independent
Notice that this is an important trick to write
down the joint probability without knowing what the outcome of the
experiment is. The ithexpression evaluates to
ππ(π·π·ππ ,π¦π¦ππ)(Could be written as a sumwith multiplicative π¦π¦ππ but
less convenient) Makes sense?
CIS 419/519 Fallβ20 56
Administration (11/30/20)β’ Remember that all the lectures are available on the website before the class
β Go over it and be preparedβ A new set of written notes will accompany most lectures, with some more details, examples and,
(when relevant) some code.
β’ HW4 is out β NNs and Bayesian Learningβ Due 12/3 12/7 (Next Monday)
β’ Next week is the last week of classes β there will be three meetings (Monday, Wednesday, Thursday)
β’ Projectsβ Most of you have chosen a project and a team. β You should have received an email from one of the TAs that will help guide your project
β’ The Final is on 12/18; it is scheduled for 9am; we will try to given you a larger window of time. β’ Project Paper and Video Submission: 12/21
Are we recording? YES!
Available on the web site
CIS 419/519 Fallβ20 57
Projectsβ’ CIS 519 students need to do a team project: Read the project descriptions and follow the updates on the Project webpage
β Teams will be of size 2-4β We will help grouping if needed
β’ There will be 3 options for projects. β Natural Language Processing (Text)β Computer Vision (Images)β Speech (Audio)
β’ In all cases, we will give you datasets and initial ideasβ The problem will be multiclass classification problemsβ You will get annotated data only for some of the labels, but will also have to predict other labelsβ 0-zero shot learning; few-shot learning; transfer learning
β’ A detailed note will come out today.
β’ Timeline:β 11/11 Choose a project and team upβ 11/23 Initial proposal describing what your team plans to do β 12/2 Progress report: 1 page. What you have done; plans; problems. β 12/21 Final paper + short video
β’ Try to make it interesting!
CIS 419/519 Fallβ20 58
Bayesian Learning
CIS 419/519 Fallβ20 59
Generating Sequences: HMMsβ’ Consider the follow process of generating a sentence: β’ First choose a part-of-speech tag (noun, verb, adjective, determiners, etc.)
β Given that, choose a word.β Next choose the next part of speech, and given it, a word.β Keep on going, until you choose a punctuation mark as a pos tag and a period.
The
DET N .AdjV
.isbook
10.2 0.50.50.5
0.40.250.250.250.25
interesting
β’ What are the probabilities you need to have in order to generate this sentence?
β’ Transition: P(post| post-1) (#(pos) x #(pos))β’ Emission: P(w|pos) (#(w) x #(pos))β’ Initial: p0(pos) (#(pos)β’ Given data, you can estimate these
probabilities β’ Once you have it, when observing a sentence
you can predict the pos-tag sequence of it.
CIS 419/519 Fallβ20 60
Generating Sequences: HMMsβ’ Consider data over 5 characters, ππ = ππ, ππ, ππ,ππ, π π , and 3 states π π = π΅π΅, πΌπΌ,ππ
β Think about chunking a sentence to phrases; π΅π΅ is the Beginning of each phrase, I is Inside a phrase, O is outside . β (Or: think about being in one of two rooms, observing ππ in it, moving to the other)
β’ We generate characters according to:β’ Initial state prob: ππ π΅π΅ = 1; ππ πΌπΌ = ππ ππ = 0β’ State transition prob:
β ππ(π΅π΅βπ΅π΅) = 0.8 ππ(π΅π΅βπΌπΌ) = 0.2β ππ(πΌπΌβπ΅π΅) = 0.5 ππ(πΌπΌβπΌπΌ) = 0.5
β’ Output prob:β ππ(ππ|π΅π΅) = 0.25,ππ(ππ|π΅π΅) = 0.10,ππ(ππ|π΅π΅) = 0.10, β¦ .β ππ(ππ|πΌπΌ) = 0.25,ππ(ππ, πΌπΌ) = 0, β¦
β’ Can follow the generation process to get the observed sequence.
ππ(π΅π΅) = 1ππ(πΌπΌ) = 0
ππ(ππ|π΅π΅)B I
0.80.2
0.5
0.5ππ(ππ|πΌπΌ)
0.80.2
0.5
0.5
a
B I BII
ddc
10.2 0.50.50.5
0.40.250.250.250.25
a
We can do the same exercise we did before.
Data: ππ1 ,ππ2, β¦ ππππ , π π 1 , π π 2, β¦ π π ππ 1ππ
Find the most likely parameters of the model:ππ(ππππ |π π ππ),ππ(π π ππ+1 |π π ππ),ππ(π π 1)
Given an unlabeled example (observation)ππ = (ππ1,ππ2, β¦ππππ)
use Bayes rule to predict the label ππ =(π π 1, π π 2, β¦ π π ππ):
ππβ = ππππππππππππππ ππ(ππ|ππ) =ππππππππππππππ ππ(ππ|ππ) ππ(ππ)/ππ(ππ)
The only issue is computational: there are 2ππpossible values of ππ (labels)
This is an HMM model (but nothing was hidden; the π π ππ were given in training. Itβs also possible to solve when the π π 1 , π π 2, β¦ π π ππ are hidden, via EM.
CIS 419/519 Fallβ20 61
Bayes Optimal Classifierβ’ How should we use the general formalism?β’ What should π»π» be?β’ π»π» can be a collection of functions. Given the training data,
choose an optimal function. Then, given new data, evaluate the selected function on it.
β’ π»π» can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction.
β’ Could be different!
CIS 419/519 Fallβ20 62
Bayes Optimal Classifierβ’ The first formalism suggests to learn a good hypothesis and
use it. β’ (Language modeling, grammar learning, etc. are here)
βππππππ = ππππππππππππββπ»π»ππ β π·π· = ππππππππππππββπ»π»ππ π·π· β ππ(β)
β’ The second one suggests to directly choose a decision.[it/in]:β’ This is the issue of βthresholdingβ vs. entertaining all options
until the last minute. (Computational Issues)
CIS 419/519 Fallβ20 63
Bayes Optimal Classifier: Exampleβ’ Assume a space of 3 hypotheses:
β ππ β1 π·π· = 0.4; ππ β2 π·π· = 0.3; ππ β3 π·π· = 0.3 β βππππππ = β1β’ Given a new instance ππ, assume that
β β1(ππ) = 1 β2(ππ) = 0 β3(ππ) = 0β’ In this case,
β ππ(ππ(ππ) = 1 ) = 0.4 ; ππ(ππ(ππ) = 0) = 0.6 πππππ‘π‘ βππππππ (ππ) = 1
β’ We want to determine the most probable classification by combining the prediction of all hypotheses, weighted by their posterior probabilities
β’ Think about it as deferring your commitment β donβt commit to the most likely hypothesis until you have to make a decision (this has computational consequences)
CIS 419/519 Fallβ20 64
Bayes Optimal Classifier: Example(2)β’ Let ππ be a set of possible classifications
ππ π£π£ππ π·π· = οΏ½βππβπ»π»
ππ π£π£ππ βππ ,π·π· ππ βππ π·π· = οΏ½βππβπ»π»
ππ π£π£ππ βππ ππ(βππ|π·π·)
β’ Bayes Optimal Classification:
π£π£ = πππππππππππππ£π£ππβππππ π£π£ππ π·π· = πππππππππππππ£π£ππβππ οΏ½βππβπ»π»
ππ π£π£ππ βππ ππ(βππ|π·π·)
β’ In the example:
ππ 1 π·π· = οΏ½βππβπ»π»
ππ 1 βππ ππ βππ π·π· = 1 β 0.4 + 0 β 0.3 + 0 β 0.3 = 0.4
ππ 0 π·π· = οΏ½βππβπ»π»
ππ 0 βππ ππ βππ π·π· = 0 β 0.4 + 1 β 0.3 + 1 β 0.3 = 0.6
β’ and the optimal prediction is indeed 0.β’ The key example of using a βBayes optimal Classifierβ is that of the NaΓ―ve Bayes algorithm.
CIS 419/519 Fallβ20
Bayesian Classifierβ’ ππ:πΏπΏβππ, finite set of valuesβ’ Instances ππβπΏπΏ can be described as a collection of features
ππ = ππ1, ππ2, β¦ ππππ ππππ β {0,1}β’ Given an example, assign it the most probable value in ππβ’ Bayes Rule: π£π£ππππππ = πππππππππππππ£π£ππβππππ π£π£ππ ππ = πππππππππππππ£π£ππβππππ(π£π£ππ|ππ1, ππ2, β¦ ππππ)
π£π£ππππππ = πππππππππππππ£π£ππβππππ(ππ1, ππ2, β¦ , ππππ|π£π£ππ)ππ π£π£ππ
ππ(ππ1, ππ2, β¦ , ππππ)= πππππππππππππ£π£ππβππππ(ππ1, ππ2, β¦ , ππππ|π£π£ππ)ππ π£π£ππ
β’ Notational convention: ππ(π¦π¦) means ππ(ππ = π¦π¦)
65
CIS 419/519 Fallβ20
Bayesian Classifierπππππ΄π΄ππ = πππππππππππππ£π£ ππ(ππ1, ππ2, β¦ , ππππ | π£π£ )ππ(π£π£)
β’ Given training data we can estimate the two terms.β’ Estimating ππ(π£π£) is easy. E.g., under the binomial distribution assumption, count the number of
times π£π£ appears in the training data. β’ However, it is not feasible to estimate ππ(ππ1, ππ2, β¦ , ππππ | π£π£ )
β In this case we have to estimate, for each target value, the probability of each instance β (most of which will not occur).
β’ In order to use a Bayesian classifiers in practice, we need to make assumptions that will allow us to estimate these quantities.
66
CIS 419/519 Fallβ20 67
Naive Bayesπππππ΄π΄ππ = πππππππππππππ£π£ ππ(ππ1, ππ2, β¦ , ππππ | π£π£ )ππ(π£π£)
ππ ππ1, ππ2, β¦ , ππππ π£π£ππ = ππ ππ1 ππ2, β¦ , ππππ, π£π£ππ ππ(ππ2, β¦ , ππππ|π£π£ππ)= ππ ππ1 ππ2, β¦ , ππππ, π£π£ππ ππ ππ2 ππ3, β¦ , ππππ,π£π£ππ ππ(ππ3, β¦ , ππππ|π£π£ππ)
= β―= ππ ππ1 ππ2, β¦ , ππππ, π£π£ππ ππ ππ2 ππ3, β¦ , ππππ, π£π£ππ ππ ππ3 ππ4, β¦ , ππππ, π£π£ππ β¦ππ ππππ π£π£ππ
= οΏ½ππ=1
ππ
ππ(ππππ|π£π£ππ)
β’ Assumption: feature values are independent given the target value
CIS 419/519 Fallβ20
Naive Bayes (2)πππππ΄π΄ππ = πππππππππππππ£π£ ππ(ππ1, ππ2, β¦ , ππππ | π£π£ )ππ(π£π£)
β’ Assumption: feature values are independent given the target value
ππ ππ1 = ππ1, ππ2 = ππ2, β¦ , ππππ = ππππ π£π£ = π£π£ππ = οΏ½ππ=1
ππ
(ππππ = ππππ|π£π£ = π£π£ππ )
β’ Generative model:β’ First choose a label value π£π£ππ βππ according to ππ(π£π£)β’ For each π£π£ππ : choose ππ1 ππ2, β¦ , ππππ according to ππ(ππππ |π£π£ππ )
68
CIS 419/519 Fallβ20
Naive Bayes (3)πππππ΄π΄ππ = πππππππππππππ£π£ ππ(ππ1, ππ2, β¦ , ππππ | π£π£ )ππ(π£π£)
β’ Assumption: feature values are independent given the target value
ππ ππ1 = ππ1, ππ2 = ππ2, β¦ , ππππ = ππππ π£π£ = π£π£ππ = βππ=1ππ (ππππ = ππππ|π£π£ = π£π£ππ )
β’ Learning method: Estimate ππ|ππ| + |ππ| parameters and use them to make a prediction. (How to estimate?)
β’ Notice that this is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions).
β’ This is learning without trying to achieve consistency or even approximate consistency.β’ Why does it work?
69
CIS 419/519 Fallβ20
β’ Notice that the features values are conditionally independent given the target value, and are not required to be independent.
β’ Example: The Boolean features are ππ and π¦π¦. We define the label to be ππ = ππ ππ,π¦π¦ = ππ β§ π¦π¦ over the product distribution:
ππ(ππ = 0) = ππ(ππ = 1) = 12
and ππ(π¦π¦ = 0) = ππ(π¦π¦ = 1) = 12
The distribution is defined so that ππ and π¦π¦ are independent: ππ(ππ, π¦π¦) = ππ(ππ)ππ(π¦π¦)
That is:
β’ But, given that ππ = 0:ππ(ππ = 1| ππ = 0) = ππ(π¦π¦ = 1 ππ = 0 = 1
3while: ππ(ππ = 1,π¦π¦ = 1 | ππ = 0) = 0so ππ and π¦π¦ are not conditionally independent.
70
Conditional Independence (1)
ππ = 0 ππ = 1ππ = 0 ΒΌ (ππ = 0) ΒΌ (ππ = 0)ππ = 1 ΒΌ (ππ = 0) ΒΌ (ππ = 1)
CIS 419/519 Fallβ20
β’ The other direction also does not hold. ππ and π¦π¦ can be conditionally independent but not independent.
β’ Example: We define a distribution s.t.:ππ = 0: ππ(ππ = 1| ππ = 0) = 1, ππ(π¦π¦ = 1| ππ = 0) = 0ππ = 1: ππ(ππ = 1| ππ = 1) = 0, ππ(π¦π¦ = 1| ππ = 1) = 1and assume, that: ππ(ππ = 0) = ππ(ππ = 1) = 1
2
β’ Given the value of l, ππ and π¦π¦ are independent (check)β’ What about unconditional independence ?ππ(ππ = 1) = ππ(ππ = 1| ππ = 0)ππ(ππ = 0) + ππ(ππ = 1| ππ = 1)ππ(ππ = 1) = 0.5 + 0 = 0.5ππ(π¦π¦ = 1) = ππ(π¦π¦ = 1| ππ = 0)ππ(ππ = 0) + ππ(π¦π¦ = 1| ππ = 1)ππ(ππ = 1) = 0 + 0.5 = 0.5
But,ππ(ππ = 1,π¦π¦ = 1) = ππ(ππ = 1,π¦π¦ = 1| ππ = 0)ππ(ππ = 0) + ππ(ππ = 1,π¦π¦ = 1| ππ = 1)ππ(ππ = 1) = 0
so ππ and π¦π¦ are not independent.
71
Conditional Independence (2)
ππ = 0 ππ = 1ππ = 0 0 (ππ = 0) Β½ (ππ = 0)ππ = 1 Β½ (ππ = 1) 0 (ππ = 1)
CIS 419/519 Fallβ20
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No 72
NaΓ―ve Bayes Example π£π£ππππ = πππππππππππππ£π£ππβππ P vj οΏ½ππ
ππ(ππππ|π£π£ππ)
CIS 419/519 Fallβ20 73
Estimating Probabilitiesβ’ π£π£ππππ = πππππππππππππ£π£β π¦π¦πππ¦π¦,ππππ P v βππ ππ(ππππ = π΅π΅πππ π π π πππ£π£πππ‘π‘πππ΅π΅ππ|π£π£)β’ How do we estimate ππ π΅π΅πππ π π π πππ£π£πππ‘π‘πππ΅π΅ππ π£π£ ?
CIS 419/519 Fallβ20 74
Exampleβ’ Compute ππ(πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π ); ππ(πππππππ¦π¦πππ π πππππππ π = πππ΅π΅)β’ Compute ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π /π΅π΅ππ/ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (6 numbers)β’ Compute ππ(πππ π ππππ = β/ππππππππ/πππ΅π΅π΅π΅ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (6 numbers)β’ Compute ππ(βπππππππππππ‘π‘π¦π¦ = βππ/πππ΅π΅ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (4 numbers)β’ Compute ππ(π π ππππππ = π π /π π π‘π‘ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (4 numbers)
π£π£ππππ = πππππππππππππ£π£ππβππ P vj οΏ½ππ
ππ(ππππ|π£π£ππ)
CIS 419/519 Fallβ20 75
Exampleβ’ Compute ππ(πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π ); ππ(πππππππ¦π¦πππ π πππππππ π = πππ΅π΅)β’ Compute ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π /π΅π΅ππ/ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (6 numbers)β’ Compute ππ(πππ π ππππ = β/ππππππππ/πππ΅π΅π΅π΅ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (6 numbers)β’ Compute ππ(βπππππππππππ‘π‘π¦π¦ = βππ/πππ΅π΅ππ| πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (4 numbers)β’ Compute ππ(π π ππππππ = π π /π π π‘π‘ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (4 numbers)
π£π£ππππ = πππππππππππππ£π£ππβππππ π£π£ππ οΏ½ππ
ππ ππππ π£π£ππ
CIS 419/519 Fallβ20 76
CIS 419/519 Fallβ20 77
Exampleβ’ Compute ππ(πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π ); ππ(πππππππ¦π¦πππ π πππππππ π = πππ΅π΅)β’ Compute ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π /π΅π΅ππ/ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (6 numbers)β’ Compute ππ(πππ π ππππ = β/ππππππππ/πππ΅π΅π΅π΅ππ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (6 numbers)β’ Compute ππ(βπππππππππππ‘π‘π¦π¦ = βππ/πππ΅π΅ππ| πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (4 numbers)β’ Compute ππ(π π ππππππ = π π /π π π‘π‘ | πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π /πππ΅π΅) (4 numbers)
β’ (1,4,4,2,2, independent numbers)
β’ Given a new instance:(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
β’ Predict: πππππππ¦π¦πππ π πππππππ π = ?
π£π£ππππ = πππππππππππππ£π£ππβππππ π£π£ππ οΏ½ππ
ππ ππππ π£π£ππ
CIS 419/519 Fallβ20 78
Exampleβ’ Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)β’ ππ πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π ππ πππππππ¦π¦πππ π πππππππ π = πππ΅π΅
= 9/14 = 0.64) = 5/14 = 0.36
β’ ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π πππππππ¦π¦ | π¦π¦π π π π ) = 2/9 ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π πππππππ¦π¦ | πππ΅π΅) = 3/5β’ ππ(π‘π‘π π ππππ = πππ΅π΅π΅π΅ππ | π¦π¦π π π π ) = 3/9 ππ(π‘π‘π π ππππ = πππ΅π΅π΅π΅ππ | πππ΅π΅) = 1/5β’ ππ(βπππππππππππ‘π‘π¦π¦ = βππ |π¦π¦π π π π ) = 3/9 ππ(βπππππππππππ‘π‘π¦π¦ = βππ | πππ΅π΅) = 4/5β’ ππ(π π ππππππ = π π π‘π‘πππ΅π΅ππππ | π¦π¦π π π π ) = 3/9 ππ(π π ππππππ = π π π‘π‘πππ΅π΅ππππ | πππ΅π΅) = 3/5
β’ ππ(π¦π¦π π π π , β¦ . . ) ~ 0.0053 ππ(πππ΅π΅, β¦ . . ) ~ 0.0206
π£π£ππππ = πππππππππππππ£π£ππβππππ π£π£ππ οΏ½ππ
ππ ππππ π£π£ππ
CIS 419/519 Fallβ20 79
Exampleβ’ Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)β’ ππ πππππππ¦π¦πππ π πππππππ π = π¦π¦π π π π ππ πππππππ¦π¦πππ π πππππππ π = πππ΅π΅
= 9/14 = 0.64) = 5/14 = 0.36
β’ ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π πππππππ¦π¦ | π¦π¦π π π π ) = 2/9 ππ(π΅π΅πππ‘π‘πππ΅π΅π΅π΅ππ = π π πππππππ¦π¦ | πππ΅π΅) = 3/5β’ ππ(π‘π‘π π ππππ = πππ΅π΅π΅π΅ππ | π¦π¦π π π π ) = 3/9 ππ(π‘π‘π π ππππ = πππ΅π΅π΅π΅ππ | πππ΅π΅) = 1/5β’ ππ(βπππππππππππ‘π‘π¦π¦ = βππ |π¦π¦π π π π ) = 3/9 ππ(βπππππππππππ‘π‘π¦π¦ = βππ | πππ΅π΅) = 4/5β’ ππ(π π ππππππ = π π π‘π‘πππ΅π΅ππππ | π¦π¦π π π π ) = 3/9 ππ(π π ππππππ = π π π‘π‘πππ΅π΅ππππ | πππ΅π΅) = 3/5
β’ ππ π¦π¦π π π π , β¦ . . ~ 0.0053 ππ πππ΅π΅, β¦ . . ~ 0.0206β’ ππ(πππ΅π΅|πππππ π π‘π‘πππππππ π ) = 0.0206/(0.0053 + 0.0206) = 0.795
What if we were asked about Outlook=OC ?
π£π£ππππ = πππππππππππππ£π£ππβππππ π£π£ππ οΏ½ππ
ππ ππππ π£π£ππ
CIS 419/519 Fallβ20
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No 80
NaΓ―ve Bayes Example π£π£ππππ = πππππππππππππ£π£ππβππ P vj οΏ½ππ
ππ(ππππ|π£π£ππ)
CIS 419/519 Fallβ20 81
Estimating Probabilities
π£π£ππππ = πππππππππππππ£π£ οΏ½ππ
ππ ππππ π£π£ ππ(π£π£)
β’ How do we estimate ππ(ππππ | π£π£) ?β’ As we suggested before, we made a Binomial assumption; then:
P(ππππ = 1 π£π£ = #(π₯π₯ππ=1 ππππ π‘π‘π‘π‘ππππππππππππ ππππ ππππ πππ₯π₯ππππππππππ ππππππππππππ π£π£)#(π£π£ ππππππππππππππ πππ₯π₯πππππππππππ¦π¦)
= ππππππ
β’ Sparsity of data is a problemβ if ππ is small, the estimate is not accurateβ if ππππ is 0, it will dominate the estimate: we will never predict π£π£ if a word that never
appeared in training (with π£π£) appears in the test data
CIS 419/519 Fallβ20 82
Robust Estimation of Probabilities
β’ This process is called smoothing.β’ There are many ways to do it, some better justified than others;β’ An empirical issue.
Here: β’ ππππ is # of occurrences of the ππ-th feature given the label π£π£β’ ππ is # of occurrences of the label π£π£β’ ππ is a prior estimate of π£π£ (e.g., uniform)β’ ππ is equivalent sample size (# of labels)
β Is this a reasonable definition?
ππ ππππ π£π£ =ππππ + ππππππ + ππ
π£π£ππππ = πππππππππππππ£π£β ππππππππ,πππππ¦π¦ππππππππ P v οΏ½ππ
ππ(ππππ = π π π΅π΅ππππππ|π£π£)
CIS 419/519 Fallβ20 83
Robust Estimation of Probabilitiesβ’ Smoothing:
β’ Common values: β Laplace Rule for the Boolean case, ππ = 1
2,ππ = 2
ππ ππππ π£π£ =ππππ + ππππππ + ππ
ππ ππππ π£π£ =ππππ + 1ππ + 2
CIS 419/519 Fallβ20
β’In practice, we will typically work in the log space:β’ π£π£ππππ = ππππππππππππ logππ π£π£ππ βππ ππ(ππππβπ£π£ππ ) =
= ππππππππππππ [ πππ΅π΅ππ ππ(π£π£ππ) + οΏ½ππ
πππ΅π΅ππ ππ(ππππ|π£π£ππ )]
85
Another comment on estimationπ£π£ππππ = πππππππππππππ£π£ππβππππ π£π£ππ οΏ½
ππ
ππ ππππ π£π£ππ
CIS 419/519 Fallβ20 86
NaΓ―ve Bayes: Two Classes
β’ Notice that the naΓ―ve Bayes method gives a method for predicting
β’ rather than an explicit classifier.β’ In the case of two classes, π£π£β{0,1} we predict that π£π£ = 1 iff:
Why do things work?
ππ π£π£ππ = 1 β βππ=1ππ ππ ππππ π£π£ππ = 1
ππ π£π£ππ = 0 β βππ=1ππ ππ ππππ π£π£ππ = 0
> 1
π£π£ππππ = πππππππππππππ£π£ππβππP vj οΏ½ππ
ππ(ππππ|π£π£ππ)
CIS 419/519 Fallβ20 87
NaΓ―ve Bayes: Two Classes
β’ Notice that the naΓ―ve Bayes method gives a method for predicting rather than an explicit classifier.
β’ In the case of two classes, π£π£β{0,1} we predict that π£π£ = 1 iff:ππ π£π£ππ = 1 β βππ=1
ππ ππ ππππ π£π£ππ = 1ππ π£π£ππ = 0 β βππ=1
ππ ππ ππππ π£π£ππ = 0> 1
Denote ππππ = ππ ππππ = 1 π£π£ = 1 , ππππ = ππ ππππ = 1 π£π£ = 0ππ π£π£ππ = 1 β βππ=1
ππ πππππ₯π₯ππ 1 β ππππ 1βπ₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ> 1
π£π£ππππ = πππππππππππππ£π£ππβππππ π£π£ππ οΏ½ππ
ππ(ππππ|π£π£ππ)
CIS 419/519 Fallβ20 88
NaΓ―ve Bayes: Two Classesβ’ In the case of two classes, π£π£β{0,1} we predict that π£π£ = 1 iff:
ππ π£π£ππ = 1 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ=ππ π£π£ππ = 1 β βππ=1
ππ (1 β ππππ) ππππ/1 β ππππ π₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ (1 β ππππ) ππππ/1 β ππππ π₯π₯ππ
> 1
CIS 419/519 Fallβ20
ππ π£π£ππ = 1 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ=ππ π£π£ππ = 1 β βππ=1
ππ (1 β ππππ) ππππ/1 β ππππ π₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ (1 β ππππ) ππππ/1 β ππππ π₯π₯ππ
> 1
89
NaΓ―ve Bayes: Two Classesβ’ In the case of two classes, π£π£β{0,1} we predict that π£π£ = 1 iff:
Take logarithm; we predict π£π£ = 1 iff
πππ΅π΅ππππ π£π£ππ = 1ππ(π£π£ππ = 0) + οΏ½
ππ
log1 β ππππ1 β ππππ
+ οΏ½ππ
(logππππ
1 β ππππβ log
ππππ1 β ππππ
)ππππ > 0
What is the decision function we derived?
CIS 419/519 Fallβ20 90
NaΓ―ve Bayes: Two Classesβ’ In the case of two classes, vβ{0,1} we predict that v=1 iff:
β’ We get that naive Bayes is a linear separator with
β’ If ππππ = ππππ then π π ππ = 0 and the feature is irrelevant
ππ π£π£ππ = 1 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ ππππ
π₯π₯ππ 1 β ππππ 1βπ₯π₯ππ=ππ π£π£ππ = 1 β βππ=1
ππ (1 β ππππ) ππππ/1 β ππππ π₯π₯ππ
ππ π£π£ππ = 0 β βππ=1ππ (1 β ππππ) ππππ/1 β ππππ π₯π₯ππ
> 1
Take logarithm; we predict π£π£ = 1 iff
πππ΅π΅ππππ π£π£ππ = 1ππ(π£π£ππ = 0) + οΏ½
ππ
log1 β ππππ1 β ππππ
+ οΏ½ππ
(logππππ
1 β ππππβ log
ππππ1 β ππππ
)ππππ > 0
π π ππ = logππππ
1 β ππππβ log
ππππ1 β ππππ
= logππππ 1 β ππππππππ 1 β ππππ
CIS 419/519 Fallβ20 91
NaΓ―ve Bayes: Two Classesβ’ In the case of two classes we have that:
logππ(π£π£ππ = 1|ππ)ππ(π£π£ππ = 0|ππ)
= οΏ½ππ
ππππππππ β ππ
β’ but sinceππ π£π£ππ = 1 ππ = 1 β ππ(π£π£ππ = 0|ππ)
β’ We get:
ππ π£π£ππ = 1 ππ =1
1 + exp(ββππππππππππ + ππ)β’ Which is simply the logistic function.
β’ The linearity of NB provides a better explanation for why it works.
We have: π΄π΄ = 1 β π΅π΅; πΏπΏπ΅π΅ππ(π΅π΅/π΄π΄) = βπΆπΆ. Then:
πΈπΈππππ(βπΆπΆ) = π΅π΅/π΄π΄ == (1β π΄π΄)/π΄π΄ = 1/π΄π΄ β 1= 1 + πΈπΈππππ(βπΆπΆ) = 1/π΄π΄π΄π΄ = 1/(1 + πΈπΈππππ(βπΆπΆ))
CIS 419/519 Fallβ20
Why Does it Work?
β’ Learning Theory β Probabilistic predictions are done via Linear Models
β The low expressivity explains Generalization + Robustness
β’ Expressivity (1: Methodology) β Is there a reason to believe that these hypotheses minimize the empirical error?
β’ In General, No. (Unless some probabilistic assumptions happen to hold). β But: if the hypothesis does not fit the training data,
β’ Experimenters will augment set of features (in fact, add dependencies)β But now, this actually follows the Learning Theory Protocol:
β’ Try to learn a hypothesis that is consistent with the dataβ’ Generalization will be a function of the low expressivity
β’ Expressivity (2: Theory)β (a) Product distributions are βdenseβ in the space of all distributions.
β’ Consequently, the resulting predictorβs error is actually close to optimal classifier.
β (b) NB classifiers cannot express all linear functions; but they converge faster to what they can express.
92
πΈπΈπππππ¦π¦ β =ππ β ππ β ππ β 1 |
|ππ|
CIS 419/519 Fallβ20
Whatβs Next? (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically?
β Yes; shown earlier(2) If probabilistic hypotheses are actually like other linear functions, can you train them similarly (that is, discriminatively)?
β Yes.β Classification: Logistics regression/Max Entropyβ HMM: can be learned as a linear model, e.g., with a version of Perceptron
(Structured Models class)
93
CIS 419/519 Fallβ20 94
Recall: NaΓ―ve Bayes, Two Classesβ’ In the case of two classes we have:
logP(π£π£ππ = 1|ππ)ππ(π£π£ππ = 0|ππ)
= οΏ½ππ
π π ππππππ β ππ
β’ but sinceππ π£π£ππ = 1 ππ = 1 β ππ(π£π£ππ = 0|ππ)
β’ We showed that:
ππ π£π£ππ = 1 ππ =1
1 + exp(ββππ π π ππππππ + ππ)
β’ Which is simply the logistic (sigmoid) function used in the β’ neural network representation.
CIS 419/519 Fallβ20 95
Whatβs Next (1)β’ (1) If probabilistic hypotheses are actually like other linear
functions, can we interpret the outcome of other linear learning algorithms probabilistically?β Yes
β’ General recipeβ Train a classifier f using your favorite algorithm (Perceptron, SVM,
Winnow, etc). Then:
β Use Sigmoid1/1+exp{-(AwTx + B)} to get an estimate for P(y | x)
β π΄π΄,π΅π΅ can be tuned using a held out set that was not used for training.β Done in LBJava, for example
CIS 419/519 Fallβ20
Whatβs Next (2): Logistic Regressionβ’ (2) If probabilistic hypotheses are actually like other linear functions, can
you actually train them similarly (that is, discriminatively)?β’ The logistic regression model assumes the following model:
ππ(π¦π¦ = +1 | ππ,ππ) = [1 + exp βπ¦π¦ ππππππ + ππ β1
β’ This is the same model we derived for naΓ―ve Bayes, only that now we will not assume any independence assumption. We will directly find the best ππ.
β’ Therefore training will be more difficult. However, the weight vector derived will be more expressive.β It can be shown that the naΓ―ve Bayes algorithm cannot represent all linear
threshold functions.β On the other hand, NB converges to its performance faster.
96
How?
CIS 419/519 Fallβ20
Logistic Regression (2)β’ Given the model:
ππ(π¦π¦ = +1 | ππ,ππ) = [1 + exp βπ¦π¦ ππππππ + ππ β1
β’ The goal is to find the (ππ, ππ) that maximizes the log likelihood of the data: {ππ1, ππ2 β¦ ππππ}.β’ We are looking for (ππ, ππ) that minimizes the negative log-likelihood
minππ,ππ
β1ππ logππ(π¦π¦ = +1 | ππ,π π ) = minππ,ππ
β1ππ log[1 + exp(βπ¦π¦ππ(ππππππππ + ππ)]
β’ This optimization problem is called Logistics Regression.
β’ Logistic Regression is sometimes called the Maximum Entropy model in the NLP community (since the resulting distribution is the one that has the largest entropy among all those that activate the same features).
97
CIS 419/519 Fallβ20
Logistic Regression (3)β’ Using the standard mapping to linear separators through the origin, we would like to minimize:
minππ,ππ
β1ππ logππ(π¦π¦ = +1 | ππ,π π ) = minππ,ππ
β1ππ log[1 + exp(βπ¦π¦ππ(ππππππππ + ππ)]
β’ To get good generalization, it is common to add a regularization term, and the regularized logistics regression then becomes:
πππππππ π ππ π π = Β½ π π πππ π + πΆπΆ οΏ½1
ππ
log[1 + exp(βπ¦π¦ππ(π π ππππππ)],
Where πΆπΆ is a user selected parameter that balances the two terms.
98
Empirical lossRegularization term
CIS 419/519 Fallβ20
Comments on Discriminative Learningβ’ πππππππ π ππ π π = Β½ π π πππ π + πΆπΆ β1ππ log[1 + exp(βπ¦π¦ππ(π π ππππππ)],
Where πΆπΆ is a user selected parameter that balances the two terms.
β’ Since the second term is the loss functionβ’ Therefore, regularized logistic regression can be related to other learning methods, e.g., SVMs.
β’ L1 SVM solves the following optimization problem:πππππππ π ππ1(π π ) = Β½ π π πππ π + πΆπΆ β1ππ max(0,1 β π¦π¦ππ(π π ππππππ)
β’ L2 SVM solves the following optimization problem: πππππππ π ππ2(π π ) = Β½ π π πππ π + πΆπΆ β1ππ max 0,1 β π¦π¦πππ π ππππππ 2
99
Empirical lossRegularization term
CIS 419/519 Fallβ20 100
A few more NB examples
CIS 419/519 Fallβ20 101
Example: Learning to Classify Textβ’ Instance space ππ: Text documentsβ’ Instances are labeled according to ππ(ππ) = πππππππ π /πππππ π πππππππ π β’ Goal: Learn this function such that, given a new documentβ’ you can use it to decide if you like it or notβ’ How to represent the document ? β’ How to estimate the probabilities ? β’ How to classify?
π£π£ππππ = πππππππππππππ£π£βππππ π£π£ οΏ½ππ
ππ(ππππ|π£π£)
CIS 419/519 Fallβ20 102
Document Representationβ’ Instance space ππ: Text documentsβ’ Instances are labeled according to π¦π¦ = ππ(ππ) = πππππππ π /πππππ π πππππππ π β’ How to represent the document ? β’ A document will be represented as a list of its words β’ The representation question can be viewed as the generation question
β’ We have a dictionary of ππ words (therefore 2ππ parameters)β’ We have documents of size ππ: can account for word position & count β’ Having a parameter for each word & position may be too much:
β # of parameters: 2 Γ ππ Γ ππ (2 Γ 100 Γ 50,000 ~ 107)β’ Simplifying Assumption:
β The probability of observing a word in a document is independent of its locationβ This still allows us to think about two ways of generating the document
CIS 419/519 Fallβ20
β’ We want to compute β’ πππππππππππππ¦π¦ ππ(π¦π¦|π·π·) = πππππππππππππ¦π¦ ππ(π·π·|π¦π¦) ππ(π¦π¦)/ππ(π·π·) =β’ = πππππππππππππ¦π¦ ππ(π·π·|π¦π¦)ππ(π¦π¦)
β’ Our assumptions will go into estimating ππ(π·π·|π¦π¦):1. Multivariate Bernoulli
I. To generate a document, first decide if itβs good (π¦π¦ = 1) or bad (π¦π¦ = 0).II. Given that, consider your dictionary of words and choose π π into your document with
probability ππ(π π |π¦π¦), irrespective of anything else. III. If the size of the dictionary is |ππ| = ππ, we can then write
β’ ππ(ππ|π¦π¦) = β1ππ ππ(π π ππ = 1|π¦π¦)ππππ ππ(π π ππ = 0|π¦π¦)1βππππ
β’ Where: β’ ππ(π π = 1/0|π¦π¦): the probability that π π appears/does-not in a π¦π¦ βlabeled document. β’ ππππ β {0,1} indicates whether word π π ππ occurs in document ππβ’ 2ππ + 2 parameters: β’ Estimating ππ(π π ππ = 1|π¦π¦) and ππ(π¦π¦) is done in the ML way as before (counting).
103
Classification via Bayes Rule (B)Parameters:1. Priors: ππ(π¦π¦ = 0/1)2. β π π ππ β π·π·πππππ‘π‘πππ΅π΅πππππππ¦π¦ππ(π π ππ = 0/1 |π¦π¦ = 0/1)
CIS 419/519 Fallβ20
β’ We want to compute β’ πππππππππππππ¦π¦ ππ(π¦π¦|π·π·) = πππππππππππππ¦π¦ ππ(π·π·|π¦π¦) ππ(π¦π¦)/ππ(π·π·) =β’ = πππππππππππππ¦π¦ ππ(π·π·|π¦π¦)ππ(π¦π¦)
β’ Our assumptions will go into estimating ππ(π·π·|π¦π¦):2. Multinomial
I. To generate a document, first decide if itβs good (π¦π¦ = 1) or bad (π¦π¦ = 0).II. Given that, place ππ words into ππ, such that π π ππ is placed with probability ππ(π π ππ|π¦π¦), and βππ
ππ ππ(π π ππ|π¦π¦) =1.
III. The Probability of a document is: ππ(ππ|π¦π¦) = ππ!/ππ1! β¦ππππ! ππ(π π 1|π¦π¦)ππ1 β¦ππ(π π ππ|π¦π¦)ππππ
β’ Where ππππ is the # of times π π ππ appears in the document.β’ Same # of parameters: 2ππ + 2, where ππ = |Dictionary|, but the estimation is done a bit differently. (HW).
104
A Multinomial Model Parameters:1. Priors: ππ(π¦π¦ = 0/1)2. β π π ππ β π·π·πππππ‘π‘πππ΅π΅πππππππ¦π¦ππ(π π ππ = 0/1 |π¦π¦ = 0/1)ππ dictionary items are chosen into π·π·
CIS 419/519 Fallβ20
β’ The generative model in these two cases is different
label
AppearDocuments d
Β΅ Β―
Words w
Bernoulli: A binary variable corresponds to a document d and a dictionary word π π , and it takes the value 1 if π π appears in d. Document topic/label is governed by a prior Β΅, its topic (label), and the variable in the intersection of the plates is governed by Β΅ and the Bernoulli parameter Β― for the dictionary word w
label
Appear (d)Documents d
Β΅ Β―
Position p
Multinomial: Words do not correspond to dictionary words but to positions (occurrences) in the document d. The internal variable is then ππ(π·π·,ππ). These variables are generated from the same multinomial distribution Β―, and depend on the topic/label.
105
Model Representation
CIS 419/519 Fallβ20 106
General NB Scenarioβ’ We assume a mixture probability model, parameterized by Β΅.β’ Different components {ππ1, ππ2, β¦ ππππ} of the model are parameterize by
disjoint subsets of Β΅.
β’ The generative story: A document ππ is created by (1) selecting a component according to the priors, ππ(ππππ |Β΅), then (2) having the mixture component generate a document according to its
own parameters, with distribution ππ(ππ|ππππ , Β΅)β’ So we have:
ππ ππ Β΅ = β1ππ ππ(ππππ|Β΅) ππ(ππ|ππππ , Β΅)β’ In the case of document classification, we assume a one to one
correspondence between components and labels.
CIS 419/519 Fallβ20
β’ ππππ can be continuousβ’ We can still use
ππ ππ1, β¦ . ,ππππ|ππ = οΏ½ππ
ππ(ππππ|ππ)
β’ And
ππ ππ = π¦π¦ ππ1, β¦ ,ππππ) =ππ ππ = π¦π¦ βππ ππ(ππππ|ππ = π¦π¦)
βππ ππ(ππ = π¦π¦ππ)βππ ππ(ππππ|ππ = π¦π¦ππ)
107
NaΓ―ve Bayes: Continuous Features
CIS 419/519 Fallβ20
β’ ππππ can be continuousβ’ We can still use
ππ ππ1, β¦ . ,ππππ|ππ = οΏ½ππ
ππ(ππππ|ππ)
β’ And
ππ ππ = π¦π¦ ππ1, β¦ ,ππππ) =ππ ππ = π¦π¦ βππ ππ(ππππ|ππ = π¦π¦)
βππ ππ(ππ = π¦π¦ππ)βππ ππ(ππππ|ππ = π¦π¦ππ)NaΓ―ve Bayes classifier:
ππ = arg maxπ¦π¦
ππ ππ = π¦π¦ οΏ½ππ
ππ(ππππ|ππ = π¦π¦)
108
NaΓ―ve Bayes: Continuous Features
CIS 419/519 Fallβ20
β’ ππππ can be continuousβ’ We can still use
ππ ππ1, β¦ . ,ππππ|ππ = οΏ½ππ
ππ(ππππ|ππ)
β’ And
ππ ππ = π¦π¦ ππ1, β¦ ,ππππ) =ππ ππ = π¦π¦ βππ ππ(ππππ|ππ = π¦π¦)
βππ ππ(ππ = π¦π¦ππ)βππ ππ(ππππ|ππ = π¦π¦ππ)NaΓ―ve Bayes classifier:
ππ = arg maxπ¦π¦
ππ ππ = π¦π¦ οΏ½ππ
ππ(ππππ|ππ = π¦π¦)
Assumption: ππ(ππππ|ππ) has a Gaussian distribution
109
NaΓ―ve Bayes: Continuous Features
CIS 419/519 Fallβ20 110
The Gaussian Probability Distributionβ’ Gaussian probability distribution also called normal distribution.β’ It is a continuous distribution with pdf:β’ Β΅ = mean of distribution β’ ππ2 = variance of distributionβ’ ππ is a continuous variable (ββ β€ ππ β€ β)β’ Probability of ππ being in the range [ππ, ππ] cannot be evaluated
analytically (has to be looked up in a table)
β’ ππ ππ = 1ππ 2ππ
π π βπ₯π₯βππ 2
2 ππ2
p(x) =1
Ο 2Οeβ (xβΒ΅ )2
2Ο2gaussian
x
CIS 419/519 Fallβ20
β’ ππ(ππππ|ππ) is Gaussianβ’ Training: estimate mean and standard deviation
β ππππ = πΈπΈ ππππ ππ = π¦π¦β ππππ2 = πΈπΈ ππππ β ππππ 2 ππ = π¦π¦
111
NaΓ―ve Bayes: Continuous Features
Note that the following slides abuse notation significantly. Since ππ(ππ) = 0 for continues distributions, we think of ππ (ππ = ππ| ππ = π¦π¦), not as a classic probability distribution, but just as a function ππ(ππ) = ππ(ππ, ππ, ππ2).ππ(ππ) behaves as a probability distribution in the sense that β ππ, ππ(ππ) β₯ 0 and the values add up to 1. Also, note that ππ(ππ) satisfies Bayes Rule, that is, it is true that:
ππππ(π¦π¦|ππ = ππ) = ππππ (ππ|ππ = π¦π¦) ππππ (π¦π¦)/ππππ(ππ)
CIS 419/519 Fallβ20
β’ ππ(ππππ|ππ) is Gaussianβ’ Training: estimate mean and standard deviation
β ππππ = πΈπΈ ππππ ππ = π¦π¦β ππππ2 = πΈπΈ ππππ β ππππ 2 ππ = π¦π¦
112
NaΓ―ve Bayes: Continuous Features
ππ1 ππ2 ππ3 ππ2 3 1 1
β1.2 2 0.4 11.2 0.3 0 02.2 1.1 0 1
CIS 419/519 Fallβ20 113
NaΓ―ve Bayes: Continuous Featuresβ’ ππ(ππππ|ππ) is Gaussianβ’ Training: estimate mean and standard
deviationβ ππππ = πΈπΈ ππππ ππ = π¦π¦β ππππ2 = πΈπΈ ππππ β ππππ 2 ππ = π¦π¦
β ππ1 = πΈπΈ ππ1 ππ = 1 = 2+ β1.2 +2.23
= 1
β ππ12 = πΈπΈ ππ1 β ππ1 ππ = 1 =2β1 2+ β1.2 β1 2+ 2.2 β1 2
3= 2.43
ππ1 ππ2 ππ3 ππ2 3 1 1
β1.2 2 0.4 11.2 0.3 0 02.2 1.1 0 1
CIS 419/519 Fallβ20 114
Recall: NaΓ―ve Bayes, Two Classesβ’ In the case of two classes we have that:
logππ(π£π£ = 1|ππ)ππ(π£π£ = 0|ππ)
= οΏ½ππ
ππππππππ β ππ
β’ but sinceππ π£π£ = 1 ππ = 1 β ππ(π£π£ = 0|ππ)
β’ We get:
ππ π£π£ = 1 ππ =1
1 + exp( ββππππππππππ + ππ)
β’ Which is simply the logistic function (also used in the neural network representation)β’ The same formula can be written for continuous features
CIS 419/519 Fallβ20
ππ π£π£ = 1 ππ =1
1 + exp(logππ(π£π£ = 0|ππ)ππ(π£π£ = 1|ππ))
=1
1 + exp(logππ π£π£ = 0 ππ(ππ|π£π£ = 0)ππ π£π£ = 1 ππ(ππ|π£π£ = 1))
=1
1 + exp(logππ π£π£ = 0ππ(π£π£ = 1) + βππ logππ(ππππ|π£π£ = 0)
ππ(ππππ|π£π£ = 1))
οΏ½ππ
logππ(ππππ|π£π£ = 0)ππ(ππππ|π£π£ = 1) =οΏ½
ππ
πππ΅π΅ππ
1
2ππππππ2exp β ππππ β ππππ0 2
2ππππ2
1
2ππππππ2exp β ππππ β ππππ1 2
2ππππ2
= οΏ½ππ
log exp(ππππ β ππππ1 2 β ππππ β ππππ0 2
2 ππππ2)
= οΏ½ππ
(ππππ0 β ππππ1
ππππ2ππππ +
ππππ12 β ππππ02
2ππππ2)
β’ Logistic function for Gaussian features
115
Logistic Function: Continuous Features
Note that we are using ratio of
probabilities, since ππ is a continuous
variable.