57
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Embed Size (px)

DESCRIPTION

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1. Go over Adaboost examples. Fix to C4.5 data formatting problem?. Quiz 4. Alternative simple (but effective) discretization method (Yang & Webb, 2001). - PowerPoint PPT Presentation

Citation preview

Page 1: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Quiz 3:

Mean: 9.2Median: 9.75

Go over problem 1

Page 2: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Go over Adaboost examples

Page 3: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Fix to C4.5 data formatting problem?

Page 4: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Quiz 4

Page 5: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Alternative simple (but effective) discretization method

(Yang & Webb, 2001)

Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.

Don’t need add-one smoothing of probabilities

This gives good balance between discretization bias and variance.

nn

Page 6: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Alternative simple (but effective) discretization method

(Yang & Webb, 2001)

Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.

Don’t need add-one smoothing of probabilities

This gives good balance between discretization bias and variance.

nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

Page 7: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Alternative simple (but effective) discretization method

(Yang & Webb, 2001)

Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.

Don’t need add-one smoothing of probabilities

This gives good balance between discretization bias and variance.

nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

8 ≈ 3 bins, 3 items per bin

Page 8: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Alternative simple (but effective) discretization method

(Yang & Webb, 2001)

Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.

Don’t need add-one smoothing of probabilities

This gives good balance between discretization bias and variance.

nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

8 ≈ 3 bins, 3 items per bin

Page 9: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer(P. Domingos and M. Pazzani)

Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

Page 10: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Page 11: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Experiments

• Compare five classification methods on 30 data sets from the UCI ML database.

SBC = Simple Bayesian Classifier

Default = “Choose class with most representatives in data”

C4.5 = Quinlan’s decision tree induction system

PEBLS = An instance-based learning system

CN2 = A rule-induction system

Page 12: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• For SBC, numeric values were discretized into ten equal-length intervals.

Page 13: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1
Page 14: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier

Same as line 1, but significant at 95% confidence

Average rank over all domains (1 is best in each domain)

Page 15: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Measuring Attribute Dependence

They used a simple, pairwise mutual information measure:

For attributes Am and An , dependence is defined as

where AmAn is a “derived attribute”, whose values consist of the possible combinations of values of Am and An

Note: If Am and An are independent, then D(Am, An | C) = 0.

D(Am, An | C)

= Entropy(Am | C) + Entropy(An | C) − Entropy(Am An | C)

Page 16: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Results:

(1) SBC is more successful than more complexmethods, even when there is substantial dependence among attributes.

(2) No correlation between degreeof attribute dependence and SBC’s rank.

But why????

Page 17: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

An Example

• Let C = {+, −}, and attributes = {A, B, C}.

• Let P(+) = P(−) = 1/2.

• Suppose A and C are completely independent, and A and B are completely dependent (e.g., A = B).

• Optimal classification procedure:

cMAP =c j ∈{+,−}

argmax P(A,B,C | c j )P(c j )

= argmaxc j ∈{+,−}

P(A | c j )P(C | c j )

Page 18: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• This leads to the following Optimal Classifier conditions:

If P(A|+) P(C|+) > P(A | −) P(C| −)

then class = + =

else class = −

• SBC conditions

If P(A|+)2 P(C|+) > P(A | −)2 P(C| −)

then class = +

else class = −

Page 19: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

p = P(+ | A)

q = P(+ | C)Optimal SBC

+

In the paper, the authors use Bayes Theorem to rewrite these conditions, and plot the “decision boundaries” for the optimal classifier and for the SBC.

Page 20: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Even though A and B are completely dependent, and the SBCassumes they are completely independent, the SBC gives the optimalclassification in a very large part of the problem space! But why?

Page 21: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• Explanation:

Suppose C = {+,−} are the possible classes. Let x be a new example with attributes <a1, a2, ..., an>.

What the naive Bayes classifier does is calculates two probabilities,

and returns the class that has the maximum probability given x.

P(+ | x) ~ P(+) P(aii

∏ | +)

P(− | x) ~ P(−) P(aii

∏ | −)

Page 22: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• The probability calculations are correct only if the independence assumption is correct.

• However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct!

• The latter covers a lot more cases than the former.

• Thus, the SBC is effective in many cases in which the independence assumption does not hold.

Page 23: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

More on Bias and Variance

Page 24: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Bias

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Page 25: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Variance

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Page 26: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Noise

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Page 27: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Sources of Bias and Variance

• Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data

• Variance arises when the classifier overfits the data

• There is often a tradeoff between bias and variance

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Page 28: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Bias-Variance Tradeoff

As a general rule,

the more biased a learning machine,

the less variance it has,

and the more variance it has,

the less biased it is.

28

From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt

Page 29: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

From: http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=e06

Page 30: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Bias-Variance Tradeoff

As a general rule,

the more biased a learning machine,

the less variance it has,

and the more variance it has,

the less biased it is.

30

From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt

Why?

Page 31: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

SVM Bias and Variance

• Bias-Variance tradeoff controlled by s• Biased classifier (linear SVM) gives better results than a

classifier that can represent the true decision boundary!

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Page 32: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Effect of Boosting

• In the early iterations, boosting is primary a bias-reducing method

• In later iterations, it appears to be primarily a variance-reducing method

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Page 33: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Bayesian Networks

Reading: S. Wooldridge, Bayesian belief networks

(linked from class website)

Page 34: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

A patient comes into a doctor’s office with a fever and a bad cough.

Hypothesis space H:

h1: patient has flu

h2: patient does not have flu

Data D:

coughing = true, fever = true,, smokes = true

Page 35: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Naive Bayes

smokes flu

cough fever

P( flu | cough, fever) ≈ P( flu)P(cough | flu)P( fever | flu)

Cause

Effects

Page 36: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters.

However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with.

Full joint probability distribution

Fever Fever Fever Fever

flu p1 p2 p3 p4

flu p5 p6 p7 p8

cough cough Sum of all boxesis 1.

fever fever fever fever

flu p9 p10 p11 p12

flu p13 p14 p15 p16

cough cough

smokes

smokes

Page 37: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Bayesian networks

• Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized.

smokes flu

cough fever

“Graphical Models”

Page 38: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

true 0.01

false 0.99

flu

smoke

cough fever

true false

true 0.9 0.1

false 0.2 0.8

fever

flu

true 0.2

false 0.8

smoke

true false

True True 0.95 0.05

True False 0.8 0.2

False True 0.6 0.4

false false 0.05 0.95

coughsmokeflu

flu

Conditional probabilitytables for each node

Page 39: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Semantics of Bayesian networks

• If network is correct, can calculate full joint probability distribution from network.

where parents(Xi) denotes specific values of parents of Xi.

P((X1 = x1)∧...∧(Xn = xn ))

= P((X i = x ii=1

n

∏ ) | parents(X i))

Page 40: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Example

• Calculate

P(cough = t∧ fever = f ∧ flu = f ∧smoke = f )

= P(X i = x ii=1

n

∏ | parents(X i))

= P(cough = t | flu = f ∧smoke = f )

×P( fever = f | flu = f )

×P( flu = f )

×P(smoke = f )

= .05 × .8 × .99 × .8

= .032

Page 41: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Another (famous, though weird) Example

Rain

Wet grass 2.0)|(

9.0)|(

RWP

RWP

4.0)( RP

Question: If you observe that the grass is wet, what is the probabilityit rained?

Page 42: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

.75.06.02.04.09.0

4.09.0

)()|()()|(

)()|(

rule) (Bayes )(

)()|()|(

RPRWPRPRWP

RPRWP

WP

RPRWPWRP

Page 43: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Sprinkler Rain

Wet grass

10.0),|(

90.0),|(

90.0),|(

95.0),|(

SRWP

SRWP

SRWP

SRWP

4.0)( RP2.0)( SP

Question: If you observe that the sprinkler is on, what is the probability that the grass is wet? (Predictive inference.)

Page 44: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

.92.06.09.04.095.0)|(

:So other).each oft independen are S and R (since

),(),|()(),|(

),(),|(),(),|()|(

SWP

SRPSRWPRPSRWP

SRPSRWPSRPSRWPSWP

Page 45: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Question: If you observe that the grass is wet, what is the probability that the sprinkler is on? (Diagnostic inference.)

35.052.0

184.0

8.06.01.08.04.09.02.06.09.020.04.095.0

2.092.0

),(),|(),(),|(),(),|(),(),|(

)()|(

rule) (Bayes )(

)()|()|(

SRPSRWPSRPSRWPSRPSRWPSRPSRWP

SPSWP

WP

SPSWPWSP

Note that P(S) = 0.2. So, knowing that grass is wet increased probabilitythat sprinkler is on.

Page 46: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Now assume the grass is wet and it rained. What is the probability that the sprinkler was on?

21.0)|(

)(),|(

)|(

)|(),|(),|(

RWP

SPSRWP

RWP

RSPSRWPWRSP

Knowing that it rained decreases the probability that the sprinkler wason, given that the grass is wet.

Page 47: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Sprinkler Rain

Wet grass

10.0),|(

90.0),|(

90.0),|(

95.0),|(

SRWP

SRWP

SRWP

SRWP

Cloudy

5.0)( CP

1.0)|(

8.0)|(

CRP

CRP5.0)|(

1.0)|(

CSP

CSP

Question: Given that it is cloudy, what is the probability that the grass is wet?

Page 48: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

)|()|(),|(

)|()|(),|(

)|()|(),|(

)|()|(),|(

)|,(),,|(

)|,(),,|(

)|,(),,|(

)|,(),,|()|(

CSPCRPSRWP

CSPCRPSRWP

CSPCRPSRWP

CSPCRPSRWP

CSRPCSRWP

CSRPCSRWP

CSRPCSRWP

CSRPCSRWPCWP

Page 49: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

In general...

• If network is correct, can calculate full joint probability distribution from network.

where parents(Xi) denotes specific values of parents of Xi.

But need efficient algorithms to do this (e.g., “belief propagation”, “Markov Chain Monte Carlo”).

))(|(),...,(1

1 i

d

iid XparentsXPXXP

Page 50: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Complexity of Bayesian Networks

For n random Boolean variables:

• Full joint probability distribution: 2n entries

• Bayesian network with at most k parents per node:

– Each conditional probability table: at most 2k entries– Entire network: n 2k entries

Page 51: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

What are the advantages of Bayesian networks?

• Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables.

• Represents “beliefs and knowledge” about a particular class of situations.

• Efficient (?) (approximate) inference algorithms

• Efficient, effective learning algorithms

Page 52: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Issues in Bayesian Networks

• Building / learning network topology

• Assigning / learning conditional probability tables

• Approximate inference via sampling

Page 53: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Real-World Example: The Lumière Project at Microsoft Research

• Bayesian network approach to answering user queries about Microsoft Office.

• “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.”

• “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

Page 54: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1
Page 55: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• Networks were developed by experts from user modeling studies.

Page 56: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1
Page 57: Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

• Offspring of project was Office Assistant in Office 97, otherwise known as “clippie”.

http://www.youtube.com/watch?v=bt-JXQS0zYc