Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Machine Learning IIEnsembles & Cotraining

CSE 573

Representing Uncertainty

© Daniel S. Weld 2

Logistics

• Reading Ch 13 Ch 14 thru 14.3

© Daniel S. Weld 3

DT Learning as Search• Nodes

• Operators

• Initial node

• Heuristic?

• Goal?

• Type of Search?

Decision Trees

Tree Refinement: Sprouting the tree

Smallest tree possible: a single leaf

Information Gain

Best tree possible (???)

Hill climbing

Bias

• Restriction

• Preference

© Daniel S. Weld 5

Decision Tree Representation

Outlook

Humidity Wind

YesYes

Yes

No No

Sunny Overcast Rain

High StrongNormal Weak

Good day for tennis?Leaves = classificationArcs = choice of valuefor parent attribute

Decision tree is equivalent to logic in disjunctive normal formG-Day (Sunny Normal) Overcast (Rain Weak)

© Daniel S. Weld 6

Overfitting

Model complexity (e.g. Number of Nodes in Decision tree )

Accuracy

0.9

0.8

0.7

0.6

On training dataOn test data

© Daniel S. Weld 7

Machine Learning Outline

•Supervised Learning Review

•Ensembles of Classifiers Bagging Cross-validated committees Boosting Stacking

•Co-training

© Daniel S. Weld 8

Voting

© Daniel S. Weld 9

Ensembles of Classifiers• Assume

Errors are independent (suppose 30% error) Majority vote

• Probability that majority is wrong…

• If individual area is 0.3• Area under curve for 11 wrong is

0.026• Order of magnitude improvement!

Ensemble of 2

1

classi

fiers

Prob 0.2

0.1

Number of classifiers in error

= area under binomial distribution

© Daniel S. Weld 10

Constructing Ensembles

• Partition examples into k disjoint equiv classes• Now create k training sets

Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data

• Now train a classifier on each set

Cross-validated committees

Hold

ou

t


Ensemble Construction II

• Generate k sets of training examples• For each set

Draw m examples randomly (with replacement) From the original set of m examples

• Each training set corresponds to 63.2% of original (+ duplicates)

• Now train classifier on each set

Bagging


Ensemble Creation III

• Maintain prob distribution over set of training ex• Create k sets of training data iteratively:• On iteration i

Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error ex

• Create harder and harder learning problems...• “Bagging with optimized choice of examples”

Boosting


Ensemble Creation IVStacking

• Train several base learners• Next train meta-learner

Learns when base learners are right / wrong Now meta learner arbitrates

Train using cross validated committees• Meta-L inputs = base learner predictions• Training examples = ‘test set’ from cross

validation


Machine Learning Outline

•Supervised Learning Review

•Ensembles of Classifiers Bagging Cross-validated committees Boosting Stacking

•Co-training


Co-Training Motivation

• Learning methods need labeled data Lots of <x, f(x)> pairs Hard to get… (who wants to label data?)

• But unlabeled data is usually plentiful… Could we use this instead??????


Co-training

• Have little labeled data + lots of unlabeled

• Each instance has two parts:x = [x1, x2]x1, x2 conditionally independent given f(x)

• Each half can be used to classify instancef1, f2 such that f1(x1) ~ f2(x2) ~ f(x)

• Both f1, f2 are learnablef1 H1, f2 H2, learning algorithms A1, A2

Suppose


Without Co-training f1(x1) ~ f2(x2) ~ f(x)

A1 learns f1 from x1

A2 learns f2 from x2A Few Labeled Instances

[x1, x2]

f2

A2

<[x1, x2], f()>

Unlabeled Instances

A1

f1 }Combine with ensemble?

Bad!! Not using

Unlabeled Instances!

f’


Co-training f1(x1) ~ f2(x2) ~ f(x)



[x1, x2]

Lots of Labeled Instances

<[x1, x2], f1(x1)>f2

Hypothesis

A2

<[x1, x2], f()>

Unlabeled InstancesA

1

f1


Observations

• Can apply A1 to generate as much training data as one wants If x1 is conditionally independent of x2 / f(x), then the error in the labels produced by A1 will look like random noise to A2 !!!

• Thus no limit to quality of the hypothesis A2 can make


Co-training f1(x1) ~ f2(x2) ~ f(x)



[x1, x2]

Lots of Labeled Instances

<[x1, x2], f1(x1)>

Hypothesis

A2

<[x1, x2], f()>

Unlabeled InstancesA

1

f1 f2

f 2

Lots of

f2f1


It really works!• Learning to classify web pages as course pages

x1 = bag of words on a page x2 = bag of words from all anchors pointing to a

page

• Naïve Bayes classifiers 12 labeled pages 1039 unlabeled

Percentage error

Representing Uncertainty


Many Techniques Developed

• Fuzzy Logic• Certainty Factors• Non-monotonic logic• Probability

• Only one has stood the test of time!

© Daniel S. Weldd


Aspects of Uncertainty

• Suppose you have a flight at 12 noon• When should you leave for SEATAC

What are traffic conditions? How crowded is security?

• Leaving 18 hours early may get you there But … ?


Decision Theory = Probability + Utility Theory

Min before noon P(arrive-in-time) 20 min 0.05 30 min 0.25 45 min 0.50 60 min 0.75

120 min 0.98 1080 min 0.99

Depends on your preferencesUtility theory: representing &

reasoning about preferences


What Is Probability?

• Probability: Calculus for dealing with nondeterminism and uncertainty Cf. Logic

• Probabilistic model: Says how often we expect different things to occur


What Is Statistics?

• Statistics 1: Describing data

• Statistics 2: Inferring probabilistic models from data

• Structure• Parameters


Why Should You Care?• The world is full of uncertainty

Logic is not enough Computers need to be able to handle uncertainty

• Probability: new foundation for AI (& CS!)

• Massive amounts of data around today Statistics and CS are both about data Statistics lets us summarize and understand it Statistics is the basis for most learning

• Statistics lets data do our work for us


Outline

• Basic notions Atomic events, probabilities, joint distribution Inference by enumeration Independence & conditional independence Bayes’ rule

• Bayesian Networks• Statistical Learning• Dynamic Bayesian networks (DBNs)• Markov decision processes (MDPs)


Prop. Logic vs. Probability

Symbol: Q, R … Random variable: Q …

Boolean values: T, F Domain: you specifye.g. {heads, tails} [1, 6]

State of the world: Assignment to Q, R … Z

Atomic event: completespecification of world: Q… Z• Mutually exclusive• Exhaustive

Prior probability (akaUnconditional prob: P(Q)

Joint distribution: Prob.of every atomic event

Types of Random Variables

Axioms of Probability Theory• Just 3 are enough to build entire theory!

1. All probabilities between 0 and 1 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is:

)()()()( BAPBPAPBAP

A B

A B

Tru

e

Prior and Joint Probability

We will see later how any question can be answered by the joint distribution

0.2

Conditional (or Posterior) Probability

• Conditional or posterior probabilitiese.g., P(cavity | toothache) = 0.8

i.e., given that Toothache is true (and all I know)

• Notation for conditional distributions:P(Cavity | Toothache) = 2-element vector of 2-

element vectors (2 P values when Toothache is true and 2 when false)

• If we know more, e.g., cavity is also given, then we haveP(cavity | toothache, cavity) = ?

• New evidence may be irrelevant, allowing simplification:P(cavity | toothache, sunny) = P(cavity | toothache) =

0.8

1

Conditional Probability

• P(A | B) is the probability of A given B• Assumes that B is the only info known.• Defined as:

)(

)()|(

BP

BAPBAP

A BAB

Tru

e

Dilemma at the Dentist’s

What is the probability of a cavity given a toothache?What is the probability of a cavity given the probe catches?


Inference by Enumeration

P(toothache)=.108+.012+.016+.064 = .20 or 20%This process is called

“Marginalization”



P(toothachecavity = .20 + ??

.072 + .008

.28



Problems with Enumeration

• Worst case time: O(dn) Where d = max arity of random

variables e.g., d = 2 for Boolean (T/F)

And n = number of random variables• Space complexity also O(dn)

Size of joint distribution• Problem: Hard/impossible to

estimate all O(dn) entries for large problems


Independence

• A and B are independent iff:

)()|( APBAP

)()|( BPABP

)()(

)()|( AP

BP

BAPBAP

)()()( BPAPBAP

These two constraints are logically equivalent

• Therefore, if A and B are independent:


Independence

Tru

e

B

A A B


Independence

Complete independence is powerful but rareWhat to do if it doesn’t hold?


Conditional IndependenceTru

e

B

A A B

A&B not independent, since P(A|B) < P(A)


Conditional IndependenceTru

e

B

A A B

C

B C

AC

But: A&B are made independent by C

P(A|C) =P(A|B,C)


Conditional Independence

Instead of 7 entries, only need 5


Conditional Independence IIP(catch | toothache, cavity) = P(catch | cavity)P(catch | toothache,cavity) = P(catch |cavity)

Why only 5 entries in table?


Power of Cond. Independence

• Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!!

• Conditional independence is the most basic & robust form of knowledge about uncertain environments.

Next Up…

• Bayes’ Rule• Bayesian Inference• Bayesian Networks

Bayes rules!


Bayes Rule

Simple proof from def of conditional probability:

)(

)()|()|(

EP

HPHEPEHP

)(

)()|(

EP

EHPEHP

)(

)()|(

HP

EHPHEP

)()|()( HPHEPEHP

QED:

(Def. cond. prob.)

(Def. cond. prob.)

)(

)()|()|(

EP

HPHEPEHP

(Mult by P(H) in line 2)

(Substitute #3 in #1)


Use to Compute Diagnostic Probability from Causal Probability

E.g. let M be meningitis, S be stiff neckP(M) = 0.0001, P(S) = 0.1, P(S|M)= 0.8

P(M|S) =


Bayes’ Rule & Cond. Independence


Bayes Nets

•In general, joint distribution P over set of variables (X1 x ... x Xn) requires exponential space for representation & inference

•BNs provide a graphical representation of conditional independence relations in P

usually quite compact requires assessment of fewer parameters, those being

quite natural (e.g., causal) efficient (usually) inference: query answering and belief

update


An Example Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio


Earthquake Example (con’t)

•If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls

P(N1|N2,A,E,B) = P(N1|A) also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E)

•By the chain rule we haveP(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B)

•Full joint requires only 10 parameters (cf. 32)

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Radio


BNs: Qualitative Structure

•Graphical structure of BN reflects conditional independence among variables

•Each variable X is a node in the DAG•Edges denote direct probabilistic influence

usually interpreted causally parents of X are denoted Par(X)

•X is conditionally independent of all nondescendents given its parents

Graphical test exists for more general independence “Markov Blanket”


Given Parents, X is Independent of

Non-Descendants


For Example

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Radio


Given Markov Blanket, X is Independent of All Other Nodes

MB(X) = Par(X) Childs(X) Par(Childs(X))


Conditional Probability Tables

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio


Conditional Probability Tables•For complete spec. of joint dist., quantify BN

•For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)|

•If X1, X2,... Xn is any topological sort of the network, then we

are assured:P(Xn,Xn-1,...X1) = P(Xn| Xn-1,...X1)·P(Xn-1 | Xn-2,… X1)

… P(X2 | X1) · P(X1)

= P(Xn| Par(Xn)) · P(Xn-1 | Par(Xn-1)) … P(X1)

Documents

Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty