61
Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Machine Learning IIEnsembles & Cotraining

CSE 573

Representing Uncertainty

Page 2: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 2

Logistics

• Reading  Ch 13  Ch 14 thru 14.3

Page 3: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 3

DT Learning as Search• Nodes

• Operators

• Initial node

• Heuristic?

• Goal?

• Type of Search?

Decision Trees

Tree Refinement: Sprouting the tree

Smallest tree possible: a single leaf

Information Gain

Best tree possible (???)

Hill climbing

Page 4: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Bias

• Restriction

• Preference

Page 5: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 5

Decision Tree Representation

Outlook

Humidity Wind

YesYes

Yes

No No

Sunny Overcast Rain

High StrongNormal Weak

Good day for tennis?Leaves = classificationArcs = choice of valuefor parent attribute

Decision tree is equivalent to logic in disjunctive normal formG-Day (Sunny Normal) Overcast (Rain Weak)

Page 6: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 6

Overfitting

Model complexity (e.g. Number of Nodes in Decision tree )

Accuracy

0.9

0.8

0.7

0.6

On training dataOn test data

Page 7: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 7

Machine Learning Outline

•Supervised Learning Review

•Ensembles of Classifiers  Bagging  Cross-validated committees  Boosting  Stacking

•Co-training

Page 8: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 8

Voting

Page 9: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 9

Ensembles of Classifiers• Assume

  Errors are independent (suppose 30% error)  Majority vote

• Probability that majority is wrong…

• If individual area is 0.3• Area under curve for 11 wrong is

0.026• Order of magnitude improvement!

Ensemble of 2

1

classi

fiers

Prob 0.2

0.1

Number of classifiers in error

  = area under binomial distribution

Page 10: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 10

Constructing Ensembles

• Partition examples into k disjoint equiv classes• Now create k training sets

  Each set is union of all equiv classes except one  So each set has (k-1)/k of the original training data

• Now train a classifier on each set

Cross-validated committees

Hold

ou

t

Page 11: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 11

Ensemble Construction II

• Generate k sets of training examples• For each set

  Draw m examples randomly (with replacement)   From the original set of m examples

• Each training set corresponds to   63.2% of original   (+ duplicates)

• Now train classifier on each set

Bagging

Page 12: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 12

Ensemble Creation III

• Maintain prob distribution over set of training ex• Create k sets of training data iteratively:• On iteration i

  Draw m examples randomly (like bagging)  But use probability distribution to bias selection  Train classifier number i on this training set  Test partial ensemble (of i classifiers) on all training exs  Modify distribution: increase P of each error ex

• Create harder and harder learning problems...• “Bagging with optimized choice of examples”

Boosting

Page 13: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 13

Ensemble Creation IVStacking

• Train several base learners• Next train meta-learner

  Learns when base learners are right / wrong  Now meta learner arbitrates

  Train using cross validated committees• Meta-L inputs = base learner predictions• Training examples = ‘test set’ from cross

validation

Page 14: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 14

Machine Learning Outline

•Supervised Learning Review

•Ensembles of Classifiers  Bagging  Cross-validated committees  Boosting  Stacking

•Co-training

Page 15: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 15

Co-Training Motivation

• Learning methods need labeled data  Lots of <x, f(x)> pairs  Hard to get… (who wants to label data?)

• But unlabeled data is usually plentiful…  Could we use this instead??????

Page 16: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 16

Co-training

• Have little labeled data + lots of unlabeled

• Each instance has two parts:x = [x1, x2]x1, x2 conditionally independent given f(x)

• Each half can be used to classify instancef1, f2 such that f1(x1) ~ f2(x2) ~ f(x)

• Both f1, f2 are learnablef1 H1, f2 H2, learning algorithms A1, A2

Suppose

Page 17: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 17

Without Co-training f1(x1) ~ f2(x2) ~ f(x)

A1 learns f1 from x1

A2 learns f2 from x2A Few Labeled Instances

[x1, x2]

f2

A2

<[x1, x2], f()>

Unlabeled Instances

A1

f1 }Combine with ensemble?

Bad!! Not using

Unlabeled Instances!

f’

Page 18: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 18

Co-training f1(x1) ~ f2(x2) ~ f(x)

A1 learns f1 from x1

A2 learns f2 from x2A Few Labeled Instances

[x1, x2]

Lots of Labeled Instances

<[x1, x2], f1(x1)>f2

Hypothesis

A2

<[x1, x2], f()>

Unlabeled InstancesA

1

f1

Page 19: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 19

Observations

• Can apply A1 to generate as much training data as one wants  If x1 is conditionally independent of x2 / f(x),  then the error in the labels produced by A1   will look like random noise to A2 !!!

• Thus no limit to quality of the hypothesis A2 can make

Page 20: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 20

Co-training f1(x1) ~ f2(x2) ~ f(x)

A1 learns f1 from x1

A2 learns f2 from x2A Few Labeled Instances

[x1, x2]

Lots of Labeled Instances

<[x1, x2], f1(x1)>

Hypothesis

A2

<[x1, x2], f()>

Unlabeled InstancesA

1

f1 f2

f 2

Lots of

f2f1

Page 21: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 21

It really works!• Learning to classify web pages as course pages

  x1 = bag of words on a page  x2 = bag of words from all anchors pointing to a

page

• Naïve Bayes classifiers  12 labeled pages  1039 unlabeled

Percentage error

Page 22: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Representing Uncertainty

Page 23: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 23

Many Techniques Developed

• Fuzzy Logic• Certainty Factors• Non-monotonic logic• Probability

• Only one has stood the test of time!

© Daniel S. Weldd

Page 24: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 24

Aspects of Uncertainty

• Suppose you have a flight at 12 noon• When should you leave for SEATAC

  What are traffic conditions?  How crowded is security?

• Leaving 18 hours early may get you there   But … ?

Page 25: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 25

Decision Theory = Probability + Utility Theory

Min before noon P(arrive-in-time) 20 min 0.05 30 min 0.25 45 min 0.50 60 min 0.75

120 min 0.98 1080 min 0.99

Depends on your preferencesUtility theory: representing &

reasoning about preferences

Page 26: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 26

What Is Probability?

• Probability: Calculus for dealing with nondeterminism and uncertainty  Cf. Logic

• Probabilistic model: Says how often we expect different things to occur

Page 27: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 27

What Is Statistics?

• Statistics 1:   Describing data

• Statistics 2:   Inferring probabilistic models from data

• Structure• Parameters

Page 28: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 28

Why Should You Care?• The world is full of uncertainty

  Logic is not enough  Computers need to be able to handle uncertainty

• Probability: new foundation for AI (& CS!)

• Massive amounts of data around today  Statistics and CS are both about data  Statistics lets us summarize and understand it  Statistics is the basis for most learning

• Statistics lets data do our work for us

Page 29: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 29

Outline

• Basic notions  Atomic events, probabilities, joint distribution  Inference by enumeration  Independence & conditional independence  Bayes’ rule

• Bayesian Networks• Statistical Learning• Dynamic Bayesian networks (DBNs)• Markov decision processes (MDPs)

Page 30: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 30

Prop. Logic vs. Probability

Symbol: Q, R … Random variable: Q …

Boolean values: T, F Domain: you specifye.g. {heads, tails} [1, 6]

State of the world: Assignment to Q, R … Z

Atomic event: completespecification of world: Q… Z• Mutually exclusive• Exhaustive

Prior probability (akaUnconditional prob: P(Q)

Joint distribution: Prob.of every atomic event

Page 31: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Types of Random Variables

Page 32: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Axioms of Probability Theory• Just 3 are enough to build entire theory!

  1. All probabilities between 0 and 1  0 ≤ P(A) ≤ 1  2. P(true) = 1 and P(false) = 0  3. Probability of disjunction of events is:

)()()()( BAPBPAPBAP

A B

A B

Tru

e

Page 33: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Prior and Joint Probability

We will see later how any question can be answered by the joint distribution

0.2

Page 34: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Conditional (or Posterior) Probability

• Conditional or posterior probabilitiese.g., P(cavity | toothache) = 0.8

i.e., given that Toothache is true (and all I know)

• Notation for conditional distributions:P(Cavity | Toothache) = 2-element vector of 2-

element vectors (2 P values when Toothache is true and 2 when false)

• If we know more, e.g., cavity is also given, then we haveP(cavity | toothache, cavity) = ?

• New evidence may be irrelevant, allowing simplification:P(cavity | toothache, sunny) = P(cavity | toothache) =

0.8

1

Page 35: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Conditional Probability

• P(A | B) is the probability of A given B• Assumes that B is the only info known.• Defined as:

)(

)()|(

BP

BAPBAP

A BAB

Tru

e

Page 36: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Dilemma at the Dentist’s

What is the probability of a cavity given a toothache?What is the probability of a cavity given the probe catches?

Page 37: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 37

Inference by Enumeration

P(toothache)=.108+.012+.016+.064 = .20 or 20%This process is called

“Marginalization”

Page 38: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 38

Inference by Enumeration

P(toothachecavity = .20 + ??

.072 + .008

.28

Page 39: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 39

Inference by Enumeration

Page 40: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Problems with Enumeration

• Worst case time: O(dn)  Where d = max arity of random

variables e.g., d = 2 for Boolean (T/F)

  And n = number of random variables• Space complexity also O(dn)

  Size of joint distribution• Problem: Hard/impossible to

estimate all O(dn) entries for large problems

Page 41: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 41

Independence

• A and B are independent iff:

)()|( APBAP

)()|( BPABP

)()(

)()|( AP

BP

BAPBAP

)()()( BPAPBAP

These two constraints are logically equivalent

• Therefore, if A and B are independent:

Page 42: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 42

Independence

Tru

e

B

A A B

Page 43: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 43

Independence

Complete independence is powerful but rareWhat to do if it doesn’t hold?

Page 44: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 44

Conditional IndependenceTru

e

B

A A B

A&B not independent, since P(A|B) < P(A)

Page 45: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 45

Conditional IndependenceTru

e

B

A A B

C

B C

AC

But: A&B are made independent by C

P(A|C) =P(A|B,C)

Page 46: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 46

Conditional Independence

Instead of 7 entries, only need 5

Page 47: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 47

Conditional Independence IIP(catch | toothache, cavity) = P(catch | cavity)P(catch | toothache,cavity) = P(catch |cavity)

Why only 5 entries in table?

Page 48: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 48

Power of Cond. Independence

• Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!!

• Conditional independence is the most basic & robust form of knowledge about uncertain environments.

Page 49: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

Next Up…

• Bayes’ Rule• Bayesian Inference• Bayesian Networks

Bayes rules!

Page 50: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 50

Bayes Rule

Simple proof from def of conditional probability:

)(

)()|()|(

EP

HPHEPEHP

)(

)()|(

EP

EHPEHP

)(

)()|(

HP

EHPHEP

)()|()( HPHEPEHP

QED:

(Def. cond. prob.)

(Def. cond. prob.)

)(

)()|()|(

EP

HPHEPEHP

(Mult by P(H) in line 2)

(Substitute #3 in #1)

Page 51: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 51

Use to Compute Diagnostic Probability from Causal Probability

E.g. let M be meningitis, S be stiff neckP(M) = 0.0001, P(S) = 0.1, P(S|M)= 0.8

P(M|S) =

Page 52: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 52

Bayes’ Rule & Cond. Independence

Page 53: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 53

Bayes Nets

•In general, joint distribution P over set of variables (X1 x ... x Xn) requires exponential space for representation & inference

•BNs provide a graphical representation of conditional independence relations in P

  usually quite compact  requires assessment of fewer parameters, those being

quite natural (e.g., causal)  efficient (usually) inference: query answering and belief

update

Page 54: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 55

An Example Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio

Page 55: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 56

Earthquake Example (con’t)

•If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls

  P(N1|N2,A,E,B) = P(N1|A)  also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E)

•By the chain rule we haveP(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B)

•Full joint requires only 10 parameters (cf. 32)

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Radio

Page 56: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 57

BNs: Qualitative Structure

•Graphical structure of BN reflects conditional independence among variables

•Each variable X is a node in the DAG•Edges denote direct probabilistic influence

  usually interpreted causally  parents of X are denoted Par(X)

•X is conditionally independent of all nondescendents given its parents

  Graphical test exists for more general independence  “Markov Blanket”

Page 57: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 58

Given Parents, X is Independent of

Non-Descendants

Page 58: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 59

For Example

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Radio

Page 59: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 60

Given Markov Blanket, X is Independent of All Other Nodes

MB(X) = Par(X) Childs(X) Par(Childs(X))

Page 60: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 61

Conditional Probability Tables

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio

Page 61: Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 62

Conditional Probability Tables•For complete spec. of joint dist., quantify BN

•For each variable X, specify CPT: P(X | Par(X))  number of params locally exponential in |Par(X)|

•If X1, X2,... Xn is any topological sort of the network, then we

are assured:P(Xn,Xn-1,...X1) = P(Xn| Xn-1,...X1)·P(Xn-1 | Xn-2,… X1)

… P(X2 | X1) · P(X1)

= P(Xn| Par(Xn)) · P(Xn-1 | Par(Xn-1)) … P(X1)