71
Fall 2008 Bayesian Learning - Sofus A. Macskassy 1 Machine Learning (CS 567) Fall 2008 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy ([email protected] ) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han ([email protected] ) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html

Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy1

Machine Learning (CS 567)

Fall 2008

Time: T-Th 5:00pm - 6:20pmLocation: GFS 118

Instructor: Sofus A. Macskassy ([email protected])Office: SAL 216

Office hours: by appointment

Teaching assistant: Cheol Han ([email protected])Office: SAL 229

Office hours: M 2-3pm, W 11-12

Class web page:http://www-scf.usc.edu/~csci567/index.html

Page 2: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy2

Lecture 11 Outline

• Bayesian Learning

– Probability theory

– Bayesian Classification

Page 3: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy3

So far: Discriminative Learning

• We want to distuinguish between different classes based on examples of each class.

• Linear classifiers and decision trees generate separating planes between the classes.

• When a new instance needs to be classified, we check on which side of the decision boundaries it falls, and classify accordingly (deterministic).

• This is called discriminative learning, because we defined an explcit boundary that discriminates between the different classes.

• The classification algorithms studied so far (except logistic regression) fall into this category.

Page 4: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy4

Next: Generative Learning• A different idea: use the data to build a model for each of

the different classes

• To classify a new instance, we match it against the different models, then decide which model it resembles more.

• This is called generative learning

• Note that we now categorize whether an instances is more or less likely to come from a given model.

• Also note that you can use these models to generate new data.

Page 5: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy5

Learning Bayesian Networks:Naïve and non-Naïve Bayes

• Hypothesis Space

– fixed size

– stochastic

– continuous parameters

• Learning Algorithm

– direct computation

– eager

– batch

Page 6: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy6

But first… basic probability theory

• Random variables

• Distributions

• Statistical formulae

Page 7: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy7

Random Variables

• A random variable is a random number (or value) determined by chance, or more formally, drawn according to a probability distribution– The probability distribution can be estimated from observed data

(e.g., throwing dice)– The probabiliy distribution can be synthetic– Discrete & continuous variables

• Typical random variables in Machine Learning Problems– The input data– The output data– Noise

• Important concept in learning: The data generating model– E.g., what is the data generating model for:

i) Throwing diceii) Regressioniii) Classificationiv) For visual perception

Page 8: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy8

Why have Random Variables?

• Our goal is to predict our target variable

• We are not given the true (presumably deterministic) function

• We are only given observations– Can observe the number of times a dice lands

on 4

– Can estimate the probability, given the input, that the dice will land on 4

– But we don’t know where the dice will land

– Can only make a guess to the most likely value of the dice, given the input.

Page 9: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy9

Distributions• The random variables only take on discrete values

– E.g., throwing dice: possible values: vi 2 {1,2,3,4,5,6}

• The probabilies sum to 1

• Discrete distributions are particularly important in classification

• Probability Mass Function or Frequency Function (normalized histogram)

0

0.05

0.1

0.15

0.2

0.25

0.3

A “non fair” dice

Page 10: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy10

Classic Discrete Distributions (I)

Bernoulli Distribution• A Bernoulli random variable takes on only two values, i.e., 0 and 1.

• P(1)=p and P(0)=1-p or in compact notation:

• The performance of a fixed number of trials with fixed probability of success (p) on each trial is known as a

Bernoulli trial.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1

P(x) for p=0.3

Page 11: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy11

Classic Discrete Distributions (II)

Binomial Distribution• Like Bernoulli distribution: binary input variables: 0 or 1, and probability

P(1)=p and P(0)=1-p

• What is the probability of k successes, P(k), in a series of n independent trials? (n>=k)

• P(k) is a binomial random variable:

• Bernoulli distribution is a subset of the binomial distribution (i.e., n=1)

Page 12: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy12

Classic Discrete Distributions (II)

Binomial DistributionBinomial p=0.25

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

P(k

)

Binomial p=0.1

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

P(k

)

Binomial p=0.5

0.00E+00

2.00E-02

4.00E-02

6.00E-02

8.00E-02

1.00E-01

1.20E-01

1.40E-01

1.60E-01

1.80E-01

2.00E-01

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

P(k

)

Page 13: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy13

Classic Discrete Distributions (III)

Multinomial Distribution• A generalization of the binomial distribution to multiple outputs (i.e.,

multiple classes an be categorized instead of just one class)

• n independent trials can result in one of r types of outcomes, where each outcome cr has a probability P(cr)=pr(pr=1).

• What is the probability P(n1,n2,…,nr), i.e., the probability that in n trials, the frequency of the r classes is (n1,n2,…,nr)? This is a multinomial

random variable:

where

Page 14: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy14

Continuous Probability Distributions• The random variables take on real values.

• Continuous distributions are discrete distributions where the number of discrete values goes to infinity while the probability of each discrete value goes to zero.

• Probabilities become densities.

• Probability density integrates to 1.

Page 15: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy15

Continuous Probability Distributions (cont’d)

• Probability Density Function p(x)

• Probability of an event:

Page 16: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy16

Classic Continuous Distributions (I)

Normal Distribution• The most used distribution

• Also called Gaussian distribution after C.F.Gauss who proposed it

• Justified by the Central Limit Theorem:

– Roughly: “if a random variable is the sum of a large number of independent random variables it is approximately normally distributed”

– Many observed variables are the sum of several random variables

• Shorthand:

Page 17: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy17

Classic Continuous Distributions (II)

Uniform Distribution• All data is equally probably within a bounded region R, p(x)=1/R

Page 18: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy18

Expected Value

• The expected value, mean or average of the random variable x is defined by:

• More generally, if f(x) is any function of x, the expected value of f is defined by:

• This is also called the center of mass.

• Note that forming an expected values is linear, in that if 1

and 2 are arbitrary constants, then we have

Page 19: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy19

Expected Value

• General rules of thumb:

• In general:

• Given a FINITE sample data, the Expectation is:

Page 20: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy20

Variance and Standard Deviation

• Variance:

• is the standard deviation of x. The variance is never negative and approaches 0 as the probability mass is centered at one point.

• The standard deviation is a simple measure of how far values of x are likely to depart from the

mean.

– i.e., the standard or typical amount one should expect a randomly drawn value of x to deviate or differ from .

Page 21: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy21

Sample variance and covariance

• Sample Variance

– Why division by (N-1)? This is to obtain an unbiased estimate of the variance. (unbiased estimate: )

• Covariance

• Sample Covariance

Page 22: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy22

Biased vs. Unbiased variance

• Biased variance:

• “Anti-biased” variance:

n

i

i Xxn

V1

2'

1

n

i

ii Xxn

V1

2'* 1

VnVn 2*2)1(

n

i

i Xxn

VV1

2'*

1

1

Page 23: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy23

Sample variances2 is sample variance, 2 is true variance, is true mean

Page 24: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy24

Conditional Probability

• P(x|y) is the probability of the occurrence of event x given that y occurred and is given as:

• Knowing that y occurred reduces the sample space to y, and the part of it where x also occurred is (x,y)

yx

Page 25: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy25

Conditional Probability

• P(x|y) is the probability of the occurrence of event x given that y occurred and is given as:

• Knowing that y occurred reduces the sample space to y, and the part of it where x also occurred is (x,y)

• This is only defined if P(y)>0. Also, because is

commutative, we have:

Page 26: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy26

Statistical Independence

• If x and y are independent then we have

• From there it follows that

• In other words, knowing that y occurred does not change the probability that x occurs (and vice versa).

y xy x

Page 27: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy27

Bayes Rule

• Remember:• Bayes Rule:

• Interpretation– P(y) is the PRIOR knowledge about y– X is new evidence to be incorporated to update my belief about y.– P(x|y) is the LIKELIHOOD of x given that y was observed.– Both prior and likelhood can often be generated beforehand, e.g.,

by histogram statistics– P(x) is a normalizing factor, corresponding to the marginal

distribution of x. Often it need not be evaluated explicitly, but it can become a great computational burden. “P(x) is an enumeration of all possible combinations of x, and the probability of their occurrence.”

– P(y|x) is the POSTERIOR probability of y, i.e., the belief in y after on discovers x.

Page 28: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy28

Learning Bayesian Networks:Naïve and non-Naïve Bayes

• Hypothesis Space

– fixed size

– stochastic

– continuous parameters

• Learning Algorithm

– direct computation

– eager

– batch

Page 29: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy29

Roles for Bayesian Methods

Provides practical learning algorithms:

• Naive Bayes learning

• Bayesian belief network learning

• Combine prior knowledge (prior probabilities) with observed data

• Requires prior probabilities

Provides useful conceptual framework

• Provides “gold standard” for evaluating other learning algorithms

Page 30: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy30

Bayes Theorem

• Consider hypothesis space H

• P(h) = prior prob. of hypothesis h2H

• P(D) = prior prob. of training data D

• P(h|D) = probability of h given D

• P(D|h) = probability of D given h

Page 31: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy31

Choosing Hypotheses

Natural choice is most probable hypothesis given the training data, or maximum a posteriori hypothesis hMAP:

If we assume P(hi) = P(hj) then can further simplify, and choose the maximum likelihood (ML) hypothesis

Page 32: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy32

Bayes Theorem: Example

Does patient have cancer or not?– A patient takes a lab test and the result comes back

positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

P (cancer) = 0.008 P (:cancer) = 0.992

P (+|cancer) = 0.98 P (-|cancer) = 0.02

P (+|:cancer) = 0.03 P (-|:cancer) = 0.97

Page 33: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy33

Bayes Theorem: Example

P (cancer) = 0.008 P (:cancer) = 0.992

P (+|cancer) = 0.98 P (-|cancer) = 0.02

P (+|:cancer) = 0.03 P (-|:cancer) = 0.97

A positive test result comes in for a patient.

What is the hMAP?

P(+|cancer)P(cancer) = (0.98)*(0.008) = .0078

P(+|:cancer)P(:cancer) = (0.03)*(0.992) = .0298

hMAP=:cancer

Page 34: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy34

Brute Force MAP Learner

1. For each hypothesis h in H, calculate the

posterior probability

2. Output the hypothesis hMAP with the

highest posterior probability

Page 35: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy35

Evolution of Posterior Probs

• As data is added, certainty of hypotheses increases.

Page 36: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy36

Classifying New Instances

So far we've sought the most probable hypothesis given the data D (i.e., hMAP)

Given new instance x, what is its most probable classification?

hMAP(x) is not the most probable classification!

Page 37: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy37

Classification Example

Consider:

• Three possible hypotheses:

P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3

• Given new instance x,

h1(x) = +, h2(x) = -, h3(x) = -

• What’s hMAP(x) ?

• What's most probable classification of x?

Page 38: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy38

Bayes Optimal Classifier

Bayes optimal classification:

Example:

• P(h1|D) = .4, P(-|h1) = 0, P(+|h2) = 1

• P(h2|D) = .3, P(-|h2) = 1, P(+|h3) = 0

• P(h3|D) = .3, P(-|h3) = 1, P(+|h3) = 0,

therefore

MAP class

Page 39: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy39

Gibbs Classifier

Bayes optimal classifier provides best result, but can be expensive if many hypotheses.

Gibbs algorithm:

1. Choose one hypothesis at random, according to P(h|D)

2. Use this one to classify new instance

Page 40: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy40

Error of Gibbs

Noteworthy fact [Haussler 1994]: Assume target concepts are drawn at random from H according to priors on H. Then:

E [errorGibbs] ≤ 2E [errorBayesOptimal]

Suppose correctly, uniform prior distribution over H, then

• Pick any hypothesis consistent with the data, with uniform probability

• Its expected error no worse than twice Bayes optimal

Page 41: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy41

Naive Bayes Classifier

Along with decision trees, neural networks, kNN, one of the most practical and most used learning methods.

When to use:

• Moderate or large training set available

• Attributes that describe instances are conditionally independent given classification

Successful applications:

• Diagnosis

• Classifying text documents

Page 42: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy42

Naïve Bayes Assumption• Suppose the features xi are discrete

• Assume the xi are conditionally independent given y.

• In other words, assume that:

• Then we have:

• For binary features, instead of O(2n) numbers to describe a model, we only need O(n)!

Page 43: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy43

Graphical Representation of Naïve Bayes Model

• Each node contains a probability table– y: P(y = k)

– xj: P(xj = v | y = k) “class conditional probability”

• Interpret as a generative model– Choose the class k according to P(y = k)

– Generate each feature independently according to P(xj=v | y=k)

– The feature values are conditionally independentP(xi,xj | y) = P(xi | y) ¢ P(xj | y)

y

x3x2x1 xn…

Page 44: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy44

Naïve Bayes Algorithm

Naïve_Bayes_Learn(examples)

For each target value yj

For each attribute value xi

Classify_New_Instance(x)

Page 45: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy45

Naïve Bayes: Example

• Consider the PlayTennis problem and new instance

<Outlook = sun, Temp = cool, Humid = high, Wind = strong>

Want to compute:

P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005

P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021

• So, yNB = n

Page 46: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy46

Naïve Bayes: Subtleties

• Conditional independence assumption is often violated

P(x1, x2 … xn, |yj) = Pi P(xi |yj)

• ...but it works surprisingly well anyway. Note don't need estimated posteriors P(yj|x) to be

correct; need only that

• See Domingos & Pazzani [1996] for analysis

• Naïve Bayes posteriors often unrealistically close to 1 or 0

Page 47: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy47

Decision Boundary of naïve Bayes with binary features

• The parameters of the model are i,1 = P(xi=1|y=1), i,0=P(xi=1|y=0), 1 = P(y=1)

• What is the decision surface?

• Using the log trick, we get:

• Note that in the equation above, the xi would be 1 or 0, depending on the values were present in the instance.

Page 48: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy48

Decision Boundary of naïve Bayes with binary features

let:

• We can re-write the decision boundary as:

• This is a linear decision boundary!

Page 49: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy49

Representing P(xj|y)

Many representations are possible– Univariate Gaussian

• if xj is a continuous random variable, then we can use a normal distribution and learn the mean and variance 2

– Multinomial/Binomial• if xj is a discrete random variable, xj 2 {v1, …, vm},

then we construct a conditional probability table

– Discretization• convert continuous xj into a discrete variable

– Kernel Density Estimates• apply a kind of nearest-neighbor algorithm to

compute P(xj | y) in neighborhood of query point

Page 50: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy50

Representing P(xj|y) – Discrete Values

– Multinomial/Binomial• if xj is a discrete random variable, xj 2 {v1, …, vm},

then we construct the conditional probability table

y = 1 y=2 … y=K

xj=v1 P(xj=v1 | y = 1) P(xj=v1 | y = 2) … P(xj=v1 | y = K)

xj=v2 P(xj=v2 | y = 1) P(xj=v2 | y = 2) … P(xj=v2 | y = K)

… … … … …

xj=vm P(xj=vm | y = 1) P(xj=vm | y = 2) … P(xj=vm | y = K)

Page 51: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy51

Discretization via Mutual Information

• Many discretization algorithms have been studied. One of the best is based on mutual information [Fayyad & Irani 93].– To discretize feature xj, grow a decision tree considering only splits

on xj. Each leaf of the resulting tree will correspond to a single value of the discretized xj.

Page 52: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy52

Discretization via Mutual Information

• Many discretization algorithms have been studied. One of the best is based on mutual information [Fayyad & Irani 93].– To discretize feature xj, grow a decision tree considering only splits

on xj. Each leaf of the resulting tree will correspond to a single value of the discretized xj.

– Stopping rule (applied at each node). Stop when

– where S is the training data in the parent node; Sl and Sr are the examples in the left and right child. K, Kl, and Kr are the corresponding number of classes present in these examples. I is the mutual information, H is the entropy, and N is the number of examples in the node.

I(xj; y) <log2(N ¡ 1)

N+

¢

N

¢= log2(3K¡2)¡[K¢H(S)¡Kl¢H(Sl)¡Kr¢H(Sr)]

Page 53: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy53

Discretization: Thermometer Encoding

• Many discretization algorithms have been studied. One of the best is based on mutual information [Fayyad & Irani 93].– An alternative encoding [Macskassy et al. 02] is similar to

thermometer coding as used in neural networks [Gallant 93].

– Rather than encode continuous values into one value, encode into 2vvalues, where v is the number of split points created.

– Each encoded binary value represents whether the continuous value was greater or less-than-or-equal to one of the split points.

p

Encoding: {>split1,· split1, >split2,· split2, >split3,· split3, >split4,· split4}

p: { 1 , 0 , 0 , 1 , 0 , 1 , 0 , 1 }

Page 54: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy54

Kernel Density Estimators

• Define to be the Gaussian Kernel with parameter

• Estimate

where Nk is the number of training examples in class k.

P(xjjy = k) =

Pfijy=kgK(xj; xi;j)

Nk

K(xj; xi;j) =1p2¼¾

exp¡µxj ¡ xi;j

¾

¶2

Page 55: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy55

Kernel Density Estimators (2)

• This is equivalent to placing a Gaussian “bump” of height 1/Nk on each training data point from class k and then adding them up

xj

P(x

j|y)

Page 56: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy56

Kernel Density Estimators (3)

• Resulting probability density

P(x

j|y)

xj

Page 57: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy57

The value chosen for is critical

0.15??? 0.50

Page 58: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy58

Learning the Probability Distributions by Direct Computation

• P(y=k) is just the fraction of training examples belonging to class k.

• For multinomial variables, P(xj = v | y = k) is the fraction of training examples in class k where xj = v

• For Gaussian variables, is the average value of xj for training examples in class k. is the sample standard deviation of those points:

¹̂jk

¾̂jk =

vuut1

Nk

X

fijyi=kg(xi;j ¡ ¹̂jk)

2

¾̂jk

Page 59: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy59

Improved Probability Estimates via Laplace Corrections

• When we have very little training data, direct probability computation can give probabilities of 0 or 1. Such extreme probabilities are “too strong” and cause problems

• Suppose we are estimating a probability P(z) and we have n0 examples where z is false and n1 examples where z is true. Our direct estimate is

• Laplace Estimate. Add 1 to the numerator and 2 to the denominator

This says that in the absence of any evidence, we expect P(z) = 0.5, but our belief is weak (equivalent to 1 example for each outcome).

• Generalized Laplace Estimate. If z has K different outcomes, then we estimate it as

P(z = 1) =n1+1

n0+ n1+2

P(z = 1) =n1

n0+ n1

P(z = 1) =n1+1

n0+ ¢ ¢ ¢+ nK¡1+K

Page 60: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy60

Naïve Bayes Applied to Diabetes Diagnosis

• Bayes nets and causality– Bayes nets work best when arrows follow the direction of causality

• two things with a common cause are likely to be conditionally independent given the cause; arrows in the causal direction capture this independence

– In a Naïve Bayes network, arrows are often not in the causal direction

• diabetes does not cause pregnancies• diabetes does not cause age

– But some arrows are correct• diabetes does cause the level of blood insulin and blood glucose

Page 61: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy61

Non-Naïve Bayes

• Manually construct a graph in which all arcs are causal

• Learning the probability tables is still easy. For example, P(Mass | Age, Preg) involves counting the number of patients of a given age and number of pregnancies that have a given body mass

• Classification:

P(D = djA;P;M; I;G) =

P(IjD = d)P(GjI;D = d)P(D = djA;M;P)

P(I;G)

Page 62: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy62

Bayesian Belief Network

Network represents a set of conditional ind. assertions:

• Each node is asserted to be conditionally ind. of its nondescendants, given its immediate predecessors.

• Directed acyclic graph

Page 63: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy63

Bayesian Belief Network

Represents joint probability distribution over all variables

• e.g., P(Storm, BusTourGroup, …, ForestFire)

• in general,

where Parents(Yi) denotes immediate predecessors of Yi in the graph

• Therefore, the joint distribution is fully defined by graph, plus the CPTS:

Page 64: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy64

Inference in Bayesian Nets

How can one infer the (probabilities of) values of one or more network variables, given observed values of others?

• Bayes net contains all information needed for this inference

• If only one variable with unknown value, easy to infer it

• Easy if BN is a “polytree”

• In general case, problem is NP hard (#P-complete, Roth 1996).

Page 65: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy65

Inference in Practice

In practice, can succeed in many cases

• Exact inference methods work well for some network structures (small “induced width”)

• Monte Carlo methods “simulate” the network randomly to calculate approximate solutions

• Now used as a primitive in more advanced learning and reasoning scenarios. (e.g., in relational learning)

Page 66: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy66

Learning Bayes Nets

Suppose structure known, variables partially observable

e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire...

Similar to training neural network with hidden units

• In fact, can learn network conditional probability tables using gradient ascent!

• Converge to network h that (locally) maximizes P(D|h)

Page 67: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy67

Gradient Ascent for BNsLet wijk denote one entry in the conditional

probability table for variable Yi in the network

e.g., if Yi = Campfire, then uik might be

<Storm = T, BusTourGroup = F>

Perform gradient ascent by repeatedly:

1. Update all wijk using training data D

2. Then, renormalize the wijk to assure

Page 68: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy68

Unknown Structure

When structure unknown...

• Algorithms use greedy search to add/subtract edges and nodes

• Active research topic

Somewhat like decision trees: searching for a discrete graph structure

Page 69: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy69

Belief Networks

• Combine prior knowledge with observed data

• Impact of prior knowledge (when correct!) is to lower the sample complexity

• Active research area (UAI)– Extend from Boolean to real-valued variables

– Parameterized distributions instead of tables

– Extend to first-order systems

– More effective inference methods

– ...

Page 70: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy70

Naïve Bayes Summary

• Advantages of Bayesian networks– Produces stochastic classifiers

• can be combined with utility functions to make optimal decisions

– Easy to incorporate causal knowledge• resulting probabilities are easy to interpret

– Very simple learning algorithms• if all variables are observed in training data

• Disadvantages of Bayesian networks– Fixed sized hypothesis space

• may underfit or overfit the data

• may not contain any good classifiers if prior knowledge is wrong

– Harder to handle continuous features

Page 71: Machine Learning (CS 567)csci567/11-12-Bayesian-Learning.pdf · 2008. 10. 16. · •To classify a new instance, we match it against the different models, then decide which model

Fall 2008 Bayesian Learning - Sofus A. Macskassy71

Evaluation of Naïve BayesCriterion LMS Logistic LDA Trees NNbr Nets NB

Mixed data no no no yes no no yes

Missing values no no yes yes some no yes

Outliers no yes no yes yes yes disc

Monotone transformations

no no no yes no some disc

Scalability yes yes yes yes no yes yes

Irrelevant inputs no no no some no no some

Linear combinations

yes yes yes no some yes yes

Interpretable yes yes yes yes no no yes

Accurate yes yes yes no no yes yes

• Naïve Bayes is very popular, particularly in natural language processing and information retrieval where there are many features compared to the number of examples

• In applications with lots of data, Naïve Bayes does not usually perform as well as more sophisticated methods