Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Quiz 3:

Mean: 9.2Median: 9.75

Go over problem 1

Go over Adaboost examples

Fix to C4.5 data formatting problem?

Quiz 4

Alternative simple (but effective) discretization method

(Yang & Webb, 2001)

Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.

Don’t need add-one smoothing of probabilities

This gives good balance between discretization bias and variance.

nn


(Yang & Webb, 2001)




nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99


(Yang & Webb, 2001)




nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

€

8 ≈ 3 bins, 3 items per bin


(Yang & Webb, 2001)




nn

Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

€

8 ≈ 3 bins, 3 items per bin

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer(P. Domingos and M. Pazzani)

Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

• This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Experiments

• Compare five classification methods on 30 data sets from the UCI ML database.

SBC = Simple Bayesian Classifier

Default = “Choose class with most representatives in data”

C4.5 = Quinlan’s decision tree induction system

PEBLS = An instance-based learning system

CN2 = A rule-induction system

• For SBC, numeric values were discretized into ten equal-length intervals.

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier

Same as line 1, but significant at 95% confidence

Average rank over all domains (1 is best in each domain)

Measuring Attribute Dependence

They used a simple, pairwise mutual information measure:

For attributes Am and An , dependence is defined as

where AmAn is a “derived attribute”, whose values consist of the possible combinations of values of Am and An

Note: If Am and An are independent, then D(Am, An | C) = 0.

€

D(Am, An | C)

= Entropy(Am | C) + Entropy(An | C) − Entropy(Am An | C)

Results:

(1) SBC is more successful than more complexmethods, even when there is substantial dependence among attributes.

(2) No correlation between degreeof attribute dependence and SBC’s rank.

But why????

An Example

• Let C = {+, −}, and attributes = {A, B, C}.

• Let P(+) = P(−) = 1/2.

• Suppose A and C are completely independent, and A and B are completely dependent (e.g., A = B).

• Optimal classification procedure:

€

cMAP =c j ∈{+,−}

argmax P(A,B,C | c j )P(c j )

= argmaxc j ∈{+,−}

P(A | c j )P(C | c j )

• This leads to the following Optimal Classifier conditions:

If P(A|+) P(C|+) > P(A | −) P(C| −)

then class = + =

else class = −

• SBC conditions

If P(A|+)2 P(C|+) > P(A | −)2 P(C| −)

then class = +

else class = −

p = P(+ | A)

q = P(+ | C)Optimal SBC

+

−

In the paper, the authors use Bayes Theorem to rewrite these conditions, and plot the “decision boundaries” for the optimal classifier and for the SBC.

Even though A and B are completely dependent, and the SBCassumes they are completely independent, the SBC gives the optimalclassification in a very large part of the problem space! But why?

• Explanation:

Suppose C = {+,−} are the possible classes. Let x be a new example with attributes <a1, a2, ..., an>.

What the naive Bayes classifier does is calculates two probabilities,

and returns the class that has the maximum probability given x.

€

P(+ | x) ~ P(+) P(aii

∏ | +)

P(− | x) ~ P(−) P(aii

∏ | −)

• The probability calculations are correct only if the independence assumption is correct.

• However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct!

• The latter covers a lot more cases than the former.

• Thus, the SBC is effective in many cases in which the independence assumption does not hold.

More on Bias and Variance

Bias

From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Variance


Noise


Sources of Bias and Variance

• Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data

• Variance arises when the classifier overfits the data

• There is often a tradeoff between bias and variance


Bias-Variance Tradeoff

As a general rule,

the more biased a learning machine,

the less variance it has,

and the more variance it has,

the less biased it is.

28

From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt

From: http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=e06

Bias-Variance Tradeoff

As a general rule,

the more biased a learning machine,

the less variance it has,

and the more variance it has,

the less biased it is.

30

From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt

Why?

SVM Bias and Variance

• Bias-Variance tradeoff controlled by s• Biased classifier (linear SVM) gives better results than a

classifier that can represent the true decision boundary!


Effect of Boosting

• In the early iterations, boosting is primary a bias-reducing method

• In later iterations, it appears to be primarily a variance-reducing method


Bayesian Networks

Reading: S. Wooldridge, Bayesian belief networks

(linked from class website)

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.6230&rep=rep1&type=pdf

A patient comes into a doctor’s office with a fever and a bad cough.

Hypothesis space H:

h1: patient has flu

h2: patient does not have flu

Data D:

coughing = true, fever = true,, smokes = true

Naive Bayes

smokes flu

cough fever

€

P( flu | cough, fever) ≈ P( flu)P(cough | flu)P( fever | flu)

Cause

Effects

In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters.

However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with.

Full joint probability distribution

Fever Fever Fever Fever

flu p1 p2 p3 p4

flu p5 p6 p7 p8

cough cough Sum of all boxesis 1.

fever fever fever fever

flu p9 p10 p11 p12

flu p13 p14 p15 p16

cough cough

smokes

smokes

Bayesian networks

• Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized.

smokes flu

cough fever

“Graphical Models”

true 0.01

false 0.99

flu

smoke

cough fever

true false

true 0.9 0.1

false 0.2 0.8

fever

flu

true 0.2

false 0.8

smoke

true false

True True 0.95 0.05

True False 0.8 0.2

False True 0.6 0.4

false false 0.05 0.95

coughsmokeflu

flu

Conditional probabilitytables for each node

Semantics of Bayesian networks

• If network is correct, can calculate full joint probability distribution from network.

where parents(Xi) denotes specific values of parents of Xi.

€

P((X1 = x1)∧...∧(Xn = xn ))

= P((X i = x ii=1

n

∏ ) | parents(X i))

Example

• Calculate

€

P(cough = t∧ fever = f ∧ flu = f ∧smoke = f )

= P(X i = x ii=1

n

∏ | parents(X i))

= P(cough = t | flu = f ∧smoke = f )

×P( fever = f | flu = f )

×P( flu = f )

×P(smoke = f )

= .05 × .8 × .99 × .8

= .032

Another (famous, though weird) Example

Rain

Wet grass 2.0)|(

9.0)|(

RWP

RWP

4.0)( RP

Question: If you observe that the grass is wet, what is the probabilityit rained?

.75.06.02.04.09.0

4.09.0

)()|()()|(

)()|(

rule) (Bayes )(

)()|()|(

RPRWPRPRWP

RPRWP

WP

RPRWPWRP

Sprinkler Rain

Wet grass

10.0),|(

90.0),|(

90.0),|(

95.0),|(

SRWP

SRWP

SRWP

SRWP

4.0)( RP2.0)( SP

Question: If you observe that the sprinkler is on, what is the probability that the grass is wet? (Predictive inference.)

.92.06.09.04.095.0)|(

:So other).each oft independen are S and R (since

),(),|()(),|(

),(),|(),(),|()|(

SWP

SRPSRWPRPSRWP

SRPSRWPSRPSRWPSWP

Question: If you observe that the grass is wet, what is the probability that the sprinkler is on? (Diagnostic inference.)

35.052.0

184.0

8.06.01.08.04.09.02.06.09.020.04.095.0

2.092.0

),(),|(),(),|(),(),|(),(),|(

)()|(

rule) (Bayes )(

)()|()|(

SRPSRWPSRPSRWPSRPSRWPSRPSRWP

SPSWP

WP

SPSWPWSP

Note that P(S) = 0.2. So, knowing that grass is wet increased probabilitythat sprinkler is on.

Now assume the grass is wet and it rained. What is the probability that the sprinkler was on?

21.0)|(

)(),|(

)|(

)|(),|(),|(

RWP

SPSRWP

RWP

RSPSRWPWRSP

Knowing that it rained decreases the probability that the sprinkler wason, given that the grass is wet.

Sprinkler Rain

Wet grass

10.0),|(

90.0),|(

90.0),|(

95.0),|(

SRWP

SRWP

SRWP

SRWP

Cloudy

5.0)( CP

1.0)|(

8.0)|(

CRP

CRP5.0)|(

1.0)|(

CSP

CSP

Question: Given that it is cloudy, what is the probability that the grass is wet?

)|()|(),|(

)|()|(),|(

)|()|(),|(

)|()|(),|(

)|,(),,|(

)|,(),,|(

)|,(),,|(

)|,(),,|()|(

CSPCRPSRWP

CSPCRPSRWP

CSPCRPSRWP

CSPCRPSRWP

CSRPCSRWP

CSRPCSRWP

CSRPCSRWP

CSRPCSRWPCWP

In general...

• If network is correct, can calculate full joint probability distribution from network.

where parents(Xi) denotes specific values of parents of Xi.

But need efficient algorithms to do this (e.g., “belief propagation”, “Markov Chain Monte Carlo”).

))(|(),...,(1

1 i

d

iid XparentsXPXXP

Complexity of Bayesian Networks

For n random Boolean variables:

• Full joint probability distribution: 2n entries

• Bayesian network with at most k parents per node:

– Each conditional probability table: at most 2k entries– Entire network: n 2k entries

What are the advantages of Bayesian networks?

• Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables.

• Represents “beliefs and knowledge” about a particular class of situations.

• Efficient (?) (approximate) inference algorithms

• Efficient, effective learning algorithms

Issues in Bayesian Networks

• Building / learning network topology

• Assigning / learning conditional probability tables

• Approximate inference via sampling

Real-World Example: The Lumière Project at Microsoft Research

• Bayesian network approach to answering user queries about Microsoft Office.

• “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.”

• “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

• Networks were developed by experts from user modeling studies.

• Offspring of project was Office Assistant in Office 97, otherwise known as “clippie”.

http://www.youtube.com/watch?v=bt-JXQS0zYc




Documents

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1