Chap5 Alternative Classification

Data Mining Classification: Alternative Techniques

Lecture Notes for Chapter 5

Introduction to Data MiningbyTan, Steinbach, Kumar

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Rule-Based ClassifierClassify records by using a collection of ifthen rules

Rule: (Condition) ywhere Condition is a conjunctions of attributes y is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No


Rule-based Classifier (Example)R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians


Application of Rule-Based ClassifierA rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians The rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => Mammal


Rule Coverage and AccuracyCoverage of a rule:Fraction of records that satisfy the antecedent of a ruleAccuracy of a rule:Fraction of records that satisfy both the antecedent and consequent of a rule

(Status=Single) No Coverage = 40%, Accuracy = 50%


Tid

Refund

Marital

Status

Taxable

Income

Class

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

10

How does Rule-based Classifier Work?R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians A lemur triggers rule R3, so it is classified as a mammalA turtle triggers both R4 and R5A dogfish shark triggers none of the rules


Characteristics of Rule-Based ClassifierMutually exclusive rulesClassifier contains mutually exclusive rules if the rules are independent of each otherEvery record is covered by at most one rule

Exhaustive rulesClassifier has exhaustive coverage if it accounts for every possible combination of attribute valuesEach record is covered by at least one rule


From Decision Trees To RulesRules are mutually exclusive and exhaustiveRule set contains as much information as the tree


Rules Can Be SimplifiedInitial Rule: (Refund=No) (Status=Married) NoSimplified Rule: (Status=Married) No


Tid

Refund

Marital

Status

Taxable

Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

10

Effect of Rule SimplificationRules are no longer mutually exclusiveA record may trigger more than one rule Solution? Ordered rule set Unordered rule set use voting schemes

Rules are no longer exhaustiveA record may not trigger any rulesSolution? Use a default class


Ordered Rule SetRules are rank ordered according to their priorityAn ordered rule set is known as a decision listWhen a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggeredIf none of the rules fired, it is assigned to the default class

R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians


Rule Ordering SchemesRule-based orderingIndividual rules are ranked based on their qualityClass-based orderingRules that belong to the same class appear together


Rule-based Ordering

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, Taxable Income No

(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

Class-based Ordering

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, Taxable Income No

(Refund=No, Marital Status={Married}) ==> No

(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes

Building Classification RulesDirect Method: Extract rules directly from data e.g.: RIPPER, CN2, Holtes 1R

Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules


Direct Method: Sequential CoveringStart from an empty ruleGrow a rule using the Learn-One-Rule functionRemove training records covered by the ruleRepeat Step (2) and (3) until stopping criterion is met


Example of Sequential Covering


Aspects of Sequential CoveringRule Growing

Instance Elimination

Rule Evaluation

Stopping Criterion

Rule Pruning


Rule GrowingTwo common strategies


(a) General-to-specific

Refund= No

Status = Single

Status = Divorced

Status = Married

Income > 80K

...

{ }

Yes: 0No: 3

Yes: 3No: 4

Yes: 3No: 4

Yes: 2No: 1

Yes: 1No: 0

Yes: 3No: 1

Refund=No,Status=Single,Income=85K(Class=Yes)

Refund=No,Status=Single,Income=90K(Class=Yes)

Refund=No,Status = Single(Class = Yes)

(b) Specific-to-general

Rule Growing (Examples)CN2 Algorithm:Start from an empty conjunct: {}Add conjuncts that minimizes the entropy measure: {A}, {A,B}, Determine the rule consequent by taking majority class of instances covered by the ruleRIPPER Algorithm:Start from an empty rule: {} => classAdd conjuncts that maximizes FOILs information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1

p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1


Instance EliminationWhy do we need to eliminate instances?Otherwise, the next rule is identical to previous ruleWhy do we remove positive instances?Ensure that the next rule is differentWhy do we remove negative instances?Prevent underestimating accuracy of ruleCompare rules R2 and R3 in the diagram


class = +

class = -

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

-

-

+

-

-

-

-

-

-

-

+

+

+

+

+

+

+

R1

R3

R2

+

Rule EvaluationMetrics:Accuracy

Laplace

M-estimate

n : Number of instances covered by rulenc : Number of instances covered by rulek : Number of classesp : Prior probability


Stopping Criterion and Rule PruningStopping criterionCompute the gainIf gain is not significant, discard the new rule

Rule PruningSimilar to post-pruning of decision treesReduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct


Summary of Direct MethodGrow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

Repeat


Direct Method: RIPPERFor 2-class problem, choose one of the classes as positive class, and the other as negative classLearn rules for positive classNegative class will be default classFor multi-class problemOrder the classes according to increasing class prevalence (fraction of instances that belong to a particular class)Learn the rule set for smallest class first, treat the rest as negative classRepeat with next smallest class as positive class


Direct Method: RIPPERGrowing a rule:Start from empty ruleAdd conjuncts as long as they improve FOILs information gainStop when rule no longer covers negative examplesPrune the rule immediately using incremental reduced error pruningMeasure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set n: number of negative examples covered by the rule in the validation setPruning method: delete any final sequence of conditions that maximizes v


Direct Method: RIPPERBuilding a Rule Set:Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both positive and negative examples covered by the ruleEach time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far


Direct Method: RIPPEROptimize the rule set:For each rule r in the rule set R Consider 2 alternative rules:

Replacement rule (r*): grow new rule from scratchRevised rule(r): add conjuncts to extend the rule r Compare the rule set for r against the rule set for r* and r Choose rule set that minimizes MDL principleRepeat rule generation and rule optimization for the remaining positive examples


Indirect Methods


P

Rule Set

r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +r4: (P=Yes,R=Yes,Q=No) ==> -r5: (P=Yes,R=Yes,Q=Yes) ==> +

Q

R

Q

-

+

+

-

+

No

No

No

No

Yes

Yes

Yes

Yes

Indirect Method: C4.5rulesExtract rules from an unpruned decision treeFor each rule, r: A y, consider an alternative rule r: A y where A is obtained by removing one of the conjuncts in ACompare the pessimistic error rate for r against all rsPrune if one of the rs has lower pessimistic error rateRepeat until we can no longer improve generalization error


Indirect Method: C4.5rulesInstead of ordering the rules, order subsets of rules (class ordering)Each subset is a collection of rules with the same rule consequent (class)Compute description length of each subset Description length = L(error) + g L(model) g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5)


Example


animals2

NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

humanyesnononoyesmammals

pythonnoyesnononoreptiles

salmonnoyesnoyesnofishes

whaleyesnonoyesnomammals

frognoyesnosometimesyesamphibians

komodonoyesnonoyesreptiles

batyesnoyesnoyesmammals

pigeonnoyesyesnoyesbirds

catyesnononoyesmammals

leopard sharkyesnonoyesnofishes

turtlenoyesnosometimesyesreptiles

penguinnoyesnosometimesyesbirds

porcupineyesnononoyesmammals

eelnoyesnoyesnofishes

salamandernoyesnosometimesyesamphibians

gila monsternoyesnonoyesreptiles

platypusnoyesnonoyesmammals

owlnoyesyesnoyesbirds

dolphinyesnonoyesnomammals

eaglenoyesyesnoyesbirds

C4.5 versus C4.5rules versus RIPPERC4.5rules:(Give Birth=No, Can Fly=Yes) Birds(Give Birth=No, Live in Water=Yes) Fishes(Give Birth=Yes) Mammals(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles( ) AmphibiansRIPPER:(Live in Water=Yes) Fishes(Have Legs=No) Reptiles(Give Birth=No, Can Fly=No, Live In Water=No) Reptiles(Can Fly=Yes,Give Birth=No) Birds() Mammals


C4.5 versus C4.5rules versus RIPPERC4.5 and C4.5rules:RIPPER:


Sheet1

PREDICTED CLASS

AmphibiansFishesReptilesBirdsMammals

ACTUALAmphibians00002

CLASSFishes03000

Reptiles00301

Birds00121

Mammals02104

C4.5AmphibiansFishesReptilesBirdsMammals

Amphibians20000

Fishes02001

Reptiles10300

Birds10030

Mammals00106

C4.5rulesAmphibiansFishesReptilesBirdsMammals

Amphibians20000

Fishes02001

Reptiles10300

Birds10030

Mammals00106

Sheet2

Sheet3

Sheet1

RIPPER:AmphibiansFishesReptilesBirdsMammals

Amphibians00002

Fishes03000

Reptiles00301

Birds00121

Mammals02104

PREDICTED CLASS

AmphibiansFishesReptilesBirdsMammals

ACTUALAmphibians20000

CLASSFishes02001

Reptiles10300

Birds10030

Mammals00106

C4.5rulesAmphibiansFishesReptilesBirdsMammals

Amphibians20000

Fishes02001

Reptiles10300

Birds10030

Mammals00106

Sheet2

Sheet3

Advantages of Rule-Based ClassifiersAs highly expressive as decision treesEasy to interpretEasy to generateCan classify new instances rapidlyPerformance comparable to decision trees


Instance-Based Classifiers Store the training records Use training records to predict the class label of unseen cases


Instance Based ClassifiersExamples:Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

Nearest neighbor Uses k closest points (nearest neighbors) for performing classification


Nearest Neighbor ClassifiersBasic idea:If it walks like a duck, quacks like a duck, then its probably a duck


Nearest-Neighbor ClassifiersRequires three thingsThe set of stored recordsDistance Metric to compute distance between recordsThe value of k, the number of nearest neighbors to retrieve

To classify an unknown record:Compute distance to other training recordsIdentify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)


Unknown record

Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x


1 nearest-neighborVoronoi Diagram


Nearest Neighbor ClassificationCompute distance between two points:Euclidean distance

Determine the class from nearest neighbor listtake the majority vote of class labels among the k-nearest neighborsWeigh the vote according to distance weight factor, w = 1/d2


Nearest Neighbor ClassificationChoosing the value of k:If k is too small, sensitive to noise pointsIf k is too large, neighborhood may include points from other classes


X

Nearest Neighbor ClassificationScaling issuesAttributes may have to be scaled to prevent distance measures from being dominated by one of the attributesExample: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M


Nearest Neighbor ClassificationProblem with Euclidean measure:High dimensional data curse of dimensionalityCan produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 00 1 1 1 1 1 1 1 1 1 1 11 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1vsd = 1.4142d = 1.4142 Solution: Normalize the vectors to unit length


Nearest neighbor Classificationk-NN classifiers are lazy learners It does not build models explicitlyUnlike eager learners such as decision tree induction and rule-based systemsClassifying unknown records are relatively expensive


Example: PEBLSPEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)Works with both continuous and nominal featuresFor nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)Each record is assigned a weight factorNumber of nearest neighbor, k = 1


Example: PEBLSDistance between nominal attribute values:d(Single,Married) = | 2/4 0/4 | + | 2/4 4/4 | = 1d(Single,Divorced) = | 2/4 1/2 | + | 2/4 1/2 | = 0d(Married,Divorced) = | 0/4 1/2 | + | 4/4 1/2 | = 1d(Refund=Yes,Refund=No)= | 0/3 3/7 | + | 3/3 4/7 | = 6/7

ClassMarital StatusSingleMarriedDivorcedYes201No241

ClassRefundYesNoYes03No34


Tid

Refund

Marital

Status

Taxable

Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

10

Example: PEBLSDistance between record X and record Y: where: wX 1 if X makes accurate prediction most of the timewX > 1 if X is not reliable for making predictions


Tid

Refund

Marital

Status

Taxable

Income

Cheat

X

Yes

Single

125K

No

Y

No

Married

100K

No

10

Bayes ClassifierA probabilistic framework for solving classification problemsConditional Probability:

Bayes theorem:


Example of Bayes TheoremGiven: A doctor knows that meningitis causes stiff neck 50% of the timePrior probability of any patient having meningitis is 1/50,000Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, whats the probability he/she has meningitis?


Bayesian ClassifiersConsider each attribute and class label as random variables

Given a record with attributes (A1, A2,,An) Goal is to predict class CSpecifically, we want to find the value of C that maximizes P(C| A1, A2,,An )

Can we estimate P(C| A1, A2,,An ) directly from data?


Bayesian ClassifiersApproach:compute the posterior probability P(C | A1, A2, , An) for all values of C using the Bayes theorem

Choose value of C that maximizes P(C | A1, A2, , An)

Equivalent to choosing value of C that maximizes P(A1, A2, , An|C) P(C)

How to estimate P(A1, A2, , An | C )?


How to Estimate Probabilities from Data?For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumptionTwo-way split: (A < v) or (A > v) choose only one of the two splits as new attributeProbability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)

k


How to Estimate Probabilities from Data?Normal distribution:

One for each (Ai,ci) pair

For (Income, Class=No):If Class=No sample mean = 110 sample variance = 2975


Nave Bayes ClassifierIf one of the conditional probability is zero, then the entire expression becomes zeroProbability estimation:

c: number of classesp: prior probabilitym: parameter


Example of Nave Bayes ClassifierA: attributesM: mammalsN: non-mammalsP(A|M)P(M) > P(A|N)P(N)=> Mammals


animals2
























pythonnoyesnonononon-mammals

salmonnoyesnoyesnonon-mammals


frognoyesnosometimesyesnon-mammals

komodonoyesnonoyesnon-mammals


pigeonnoyesyesnoyesnon-mammals


leopard sharkyesnonoyesnonon-mammals

turtlenoyesnosometimesyesnon-mammals

penguinnoyesnosometimesyesnon-mammals


eelnoyesnoyesnonon-mammals

salamandernoyesnosometimesyesnon-mammals

gila monsternoyesnonoyesnon-mammals


owlnoyesyesnoyesnon-mammals


eaglenoyesyesnoyesnon-mammals

animals2
























pythonnoyesnonononon-mammals

salmonnoyesnoyesnonon-mammals


frognoyesnosometimesyesnon-mammals

komodonoyesnonoyesnon-mammals


pigeonnoyesyesnoyesnon-mammals


leopard sharkyesnonoyesnonon-mammals

turtlenoyesnosometimesyesnon-mammals

penguinnoyesnosometimesyesnon-mammals


eelnoyesnoyesnonon-mammals

salamandernoyesnosometimesyesnon-mammals

gila monsternoyesnonoyesnon-mammals


owlnoyesyesnoyesnon-mammals


eaglenoyesyesnoyesnon-mammals


humanyesnonoyesno?

Nave Bayes (Summary)Robust to isolated noise points

Handle missing values by ignoring the instance during probability estimate calculations

Robust to irrelevant attributes

Independence assumption may not hold for some attributesUse other techniques such as Bayesian Belief Networks (BBN)


Artificial Neural Networks (ANN)Output Y is 1 if at least two of the three inputs are equal to 1.


X1

X2

X3

Y

Black box

Output

Input

Artificial Neural Networks (ANN)


S

X1

X2

X3

Y

Black box

0.3

0.3

0.3

t=0.4

Output node

Input nodes

Artificial Neural Networks (ANN)Model is an assembly of inter-connected nodes and weighted links

Output node sums up each of its input value according to the weights of its links

Compare output node against some threshold t

Perceptron Modelor


S

X1

X2

X3

Y

Black box

w1

w2

w3

t

Output node

Input nodes

General Structure of ANNTraining ANN means learning the weights of the neurons


Activation function g(Si )

Si

Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron i

Input

Output

threshold, t

Input Layer

Hidden Layer

Output Layer

x1

x2

x3

x4

x5

y

Algorithm for learning ANNInitialize the weights (w0, w1, , wk)

Adjust the weights in such a way that the output of ANN is consistent with class labels of training examplesObjective function:

Find the weights wis that minimize the above objective function e.g., backpropagation algorithm (see lecture notes)


Support Vector MachinesFind a linear hyperplane (decision boundary) that will separate the data


Support Vector MachinesOne Possible Solution


B1

Support Vector MachinesAnother possible solution


B2

Support Vector MachinesOther possible solutions


B2

Support Vector MachinesWhich one is better? B1 or B2?How do you define better?


B1

B2

Support Vector MachinesFind hyperplane maximizes the margin => B1 is better than B2


B1

B2

b11

b12

b21

b22

margin

Support Vector Machines


B1

b11

b12

Support Vector MachinesWe want to maximize:

Which is equivalent to minimizing:

But subjected to the following constraints:

This is a constrained optimization problem

Numerical approaches to solve it (e.g., quadratic programming)


Support Vector MachinesWhat if the problem is not linearly separable?


Support Vector MachinesWhat if the problem is not linearly separable?Introduce slack variables Need to minimize:

Subject to:


Nonlinear Support Vector MachinesWhat if decision boundary is not linear?


Nonlinear Support Vector MachinesTransform data into higher dimensional space


Ensemble MethodsConstruct a set of classifiers from the training data

Predict class label of previously unseen records by aggregating predictions made by multiple classifiers


General Idea


Original Training data

....

D1

D2

Dt-1

Dt

D

Step 1: Create Multiple Data Sets

C1

C2

Ct -1

Ct

Step 2: Build Multiple Classifiers

C*

Step 3: Combine Classifiers

Why does it work?Suppose there are 25 base classifiersEach classifier has error rate, = 0.35Assume classifiers are independentProbability that the ensemble classifier makes a wrong prediction:


Examples of Ensemble MethodsHow to generate an ensemble of classifiers?Bagging

Boosting


BaggingSampling with replacement

Build classifier on each bootstrap sample

Each sample has probability (1 1/n)n of being selected


BoostingAn iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified recordsInitially, all N records are assigned equal weightsUnlike bagging, weights may change at the end of boosting round


BoostingRecords that are wrongly classified will have their weights increasedRecords that are classified correctly will have their weights decreased

Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds


Example: AdaBoostBase classifiers: C1, C2, , CT

Error rate:

Importance of a classifier:


Example: AdaBoostWeight update:

If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeatedClassification:


Illustrating AdaBoost


Boosting Round 1

+

+

+

-

-

-

-

-

-

-

B1

0.0094

0.0094

0.4623

a = 1.9459

Illustrating AdaBoost


Boosting Round 1

+

+

+

-

-

-

-

-

-

-

Boosting Round 2

0.3037

-

-

-

-

-

-

-

-

+

+

Boosting Round 3

+

+

+

+

+

+

+

+

+

+

Overall

+

+

+

-

-

-

-

-

+

+

0.0009

0.0422

0.0276

0.1819

0.0038

B1

0.0094

0.0094

0.4623

B2

B3

a = 1.9459

a = 2.9323

a = 3.8744


Documents

Chap5 Alternative Classification