CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins, [email protected]

CMSC 471CMSC 471Spring 2014Spring 2014

Class #16Class #16

Thursday, March 27, 2014Thursday, March 27, 2014Machine Learning IIMachine Learning II

Professor Marie desJardins, [email protected]

2

Computing Information Gain

French

Italian

Thai

Burger

Empty Some

Full

Y

Y

Y

Y

Y

YN

N

N

N

N

N

I(T) =

I(Pat, T) =

I(Type, T) =

Gain (Pat, T) =

Gain (Type, T) =

€

Gain(A,S) = I(S) − I(A,S) = I(S) −| Sv |

| S |v∈Values(A )∑ × I(Sv )

3

Computing Information GainFrench

Italian

Thai

Burger

Empty Some Full

Y

Y

Y

Y

Y

YN

N

N

N

N

N

•I(T) = - (.5 log .5 + .5 log .5) = .5 + .5 = 1

•I(Type, T) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 +

1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) = .47

•I(Type, T) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1

Gain (Pat, T) = 1 - .47 = .53Gain (Type, T) = 1 – 1 = 0

4

Using Gain Ratios• The information gain criterion favors attributes that have a large

number of values

– If we have an attribute D that has a distinct value for each record, then I(D,T) is 0, thus Gain(D,T) is maximal

• To compensate for this, Quinlan suggests using the following ratio instead of Gain:

GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T)

• SplitInfo(D,T) is the information due to the split of T on the basis of value of categorical attribute D

SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)

where {T1, T2, .. Tm} is the partition of T induced by value of D

5

Computing Gain Ratio

French

Italian

Thai

Burger

Empty Some Full

Y

Y

Y

Y

Y

YN

N

N

N

N

N

I(T) = 1

I (Pat, T) = .47

I (Type, T) = 1

Gain (Pat, T) =.53Gain (Type, T) = 0

SplitInfo (Pat, T) =

SplitInfo (Type, T) =

GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / ______ =

GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / ____ = 0 !!

SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)

6

Computing Gain RatioFrench

Italian

Thai

Burger

Empty Some Full

Y

Y

Y

Y

Y

YN

N

N

N

N

N

•I(T) = 1

•I (Pat, T) = .47

•I (Type, T) = 1

Gain (Pat, T) =.53Gain (Type, T) = 0

SplitInfo (Pat, T) = - (1/6 log 1/6 + 1/3 log 1/3 + 1/2 log 1/2) = 1/6*2.6 + 1/3*1.6 + 1/2*1 = 1.47

SplitInfo (Type, T) = 1/6 log 1/6 + 1/6 log 1/6 + 1/3 log 1/3 + 1/3 log 1/3 = 1/6*2.6 + 1/6*2.6 + 1/3*1.6 + 1/3*1.6 = 1.93

GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / 1.47 = .36

GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / 1.93 = 0

7

Bayesian Learning

Chapter 20.1-20.2

Some material adapted from lecture notes by Lise Getoor and Ron Parr

Naïve Bayes

8

Naïve Bayes

• Use Bayesian modeling

• Make the simplest possible independence assumption:– Each attribute is independent of the values of the other attributes,

given the class variable

– In our restaurant domain: Cuisine is independent of Patrons, given a decision to stay (or not)

9

Bayesian Formulation• p(C | F1, ..., Fn) = p(C) p(F1, ..., Fn | C) / P(F1, ..., Fn)

= α p(C) p(F1, ..., Fn | C)

• Assume that each feature Fi is conditionally independent of the other features given the class C. Then:p(C | F1, ..., Fn) = α p(C) Πi p(Fi | C)

• We can estimate each of these conditional probabilities from the observed counts in the training data:p(Fi | C) = N(Fi ∧ C) / N(C)– One subtlety of using the algorithm in practice: When your estimated

probabilities are zero, ugly things happen

– The fix: Add one to every count (aka “Laplacian smoothing”—they have a different name for everything!)

10

Naive Bayes: Example

• p(Wait | Cuisine, Patrons, Rainy?) = α p(Cuisine ∧ Patrons ∧ Rainy? | Wait)

= α p(Wait) p(Cuisine | Wait) p(Patrons | Wait) p(Rainy? | Wait)

11

naive Bayes assumption: is it reasonable?

Naive Bayes: Analysis

• Naive Bayes is amazingly easy to implement (once you understand the bit of math behind it)

• Remarkably, naive Bayes can outperform many much more complex algorithms—it’s a baseline that should pretty much always be used for comparison

• Naive Bayes can’t capture interdependencies between variables (obviously)—for that, we need Bayes nets!

12

Learning Bayesian Networks

13

14

Bayesian Learning: Bayes’ Rule

• Given some model space (set of hypotheses hi) and evidence (data D):– P(hi|D) = P(D|hi) P(hi)

• We assume that observations are independent of each other, given a model (hypothesis), so:– P(hi|D) = j P(dj|hi) P(hi)

• To predict the value of some unknown quantity, X (e.g., the class label for a future observation):– P(X|D) = i P(X|D, hi) P(hi|D) = i P(X|hi) P(hi|D)

These are equal by ourindependence assumption

15

Bayesian Learning• We can apply Bayesian learning in three basic ways:

– BMA (Bayesian Model Averaging): Don’t just choose one hypothesis; instead, make predictions based on the weighted average of all hypotheses (or some set of best hypotheses)

– MAP (Maximum A Posteriori) hypothesis: Choose the hypothesis with the highest a posteriori probability, given the data

– MLE (Maximum Likelihood Estimate): Assume that all hypotheses are equally likely a priori; then the best hypothesis is just the one that maximizes the likelihood (i.e., the probability of the data given the hypothesis)

• MDL (Minimum Description Length) principle: Use some encoding to model the complexity of the hypothesis, and the fit of the data to the hypothesis, then minimize the overall description of hi + D

16

Learning Bayesian Networks

• Given training set

• Find B that best matches D– model selection

– parameter estimation

]}[],...,1[{ MxxD

Data D

InducerInducerInducerInducer

C

A

EB

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

17

Parameter Estimation• Assume known structure

• Goal: estimate BN parameters – entries in local probability models, P(X | Parents(X))

• A parameterization is good if it is likely to generate the observed data:

• Maximum Likelihood Estimation (MLE) Principle: Choose so as to maximize L

m

mxPDPDL )|][()|():(

i.i.d. samples

18

Parameter Estimation II• The likelihood decomposes according to the structure of

the network→ we get a separate estimation task for each parameter

• The MLE (maximum likelihood estimate) solution:– for each value x of a node X

– and each instantiation u of Parents(X)

– Just need to collect the counts for every combination of parents and children observed in the data

– MLE is equivalent to an assumption of a uniform prior over parameter values

)(

),(*| uN

uxNux sufficient statistics

19

Sufficient Statistics: Example

• Why are the counts sufficient?

Earthquake Burglary

Alarm

Moon-phase

Light-level

)(

),(*| uN

uxNux

θ*A | E, B = N(A, E, B) / N(E, B)

20

Model Selection

Goal: Select the best network structure, given the data

Input:

– Training data

– Scoring function

Output:

– A network that maximizes the score

21

Structure Selection: Scoring• Bayesian: prior over parameters and structure

– get balance between model complexity and fit to data as a byproduct

• Score (G:D) = log P(G|D) log [P(D|G) P(G)]

• Marginal likelihood just comes from our parameter estimates

• Prior on structure can be any measure we want; typically a function of the network complexity

Same key property: Decomposability

Score(structure) = i Score(family of Xi)

Marginal likelihoodPrior

22

Heuristic Search

B E

A

C

B E

A

C

B E

A

C

B E

A

C

Δscore(C)Add EC

Δscore(A)

Delete EAΔscore(A,E

)Reverse E

A

23

Exploiting Decomposability

B E

A

C

B E

A

C

B E

A

C

Δscore(C)Add EC

Δscore(A)

Delete EAΔscore(A)

Reverse EA

B E

A

C

Δscore(A)

Delete EA

To recompute scores, only need to re-score familiesthat changed in the last move

24

Variations on a Theme

• Known structure, fully observable: only need to do parameter estimation

• Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation

• Known structure, missing values: use expectation maximization (EM) to estimate parameters

• Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques

• Unknown structure, hidden variables: too hard to solve!

25

Handling Missing Data

• Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary

• Should we throw that data away??

• Idea: Guess the missing valuesbased on the other data

Earthquake Burglary

Alarm

Moon-phase

Light-level

26

EM (Expectation Maximization)

• Guess probabilities for nodes with missing values (e.g., based on other observations)

• Compute the probability distribution over the missing values, given our guess

• Update the probabilities based on the guessed values

• Repeat until convergence

27

EM Example

• Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27

• We estimate the CPTs based on the rest of the data

• We then estimate P(Burglary) for November 27 from those CPTs

• Now we recompute the CPTs as if that estimated value had been observed

• Repeat until convergence! Earthquake Burglary

Alarm

Documents

CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins, [email protected]