88
Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Chap5 Alternative Classification

Embed Size (px)

DESCRIPTION

Describes the chapter 5 of data mining by vipinn kumar

Citation preview

  • Data Mining Classification: Alternative Techniques

    Lecture Notes for Chapter 5

    Introduction to Data MiningbyTan, Steinbach, Kumar

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Rule-Based ClassifierClassify records by using a collection of ifthen rules

    Rule: (Condition) ywhere Condition is a conjunctions of attributes y is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Rule-based Classifier (Example)R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Application of Rule-Based ClassifierA rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

    R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians The rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => Mammal

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Rule Coverage and AccuracyCoverage of a rule:Fraction of records that satisfy the antecedent of a ruleAccuracy of a rule:Fraction of records that satisfy both the antecedent and consequent of a rule

    (Status=Single) No Coverage = 40%, Accuracy = 50%

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Class

    1

    Yes

    Single

    125K

    No

    2

    No

    Married

    100K

    No

    3

    No

    Single

    70K

    No

    4

    Yes

    Married

    120K

    No

    5

    No

    Divorced

    95K

    Yes

    6

    No

    Married

    60K

    No

    7

    Yes

    Divorced

    220K

    No

    8

    No

    Single

    85K

    Yes

    9

    No

    Married

    75K

    No

    10

    No

    Single

    90K

    Yes

    10

  • How does Rule-based Classifier Work?R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians A lemur triggers rule R3, so it is classified as a mammalA turtle triggers both R4 and R5A dogfish shark triggers none of the rules

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Characteristics of Rule-Based ClassifierMutually exclusive rulesClassifier contains mutually exclusive rules if the rules are independent of each otherEvery record is covered by at most one rule

    Exhaustive rulesClassifier has exhaustive coverage if it accounts for every possible combination of attribute valuesEach record is covered by at least one rule

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • From Decision Trees To RulesRules are mutually exclusive and exhaustiveRule set contains as much information as the tree

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Rules Can Be SimplifiedInitial Rule: (Refund=No) (Status=Married) NoSimplified Rule: (Status=Married) No

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    1

    Yes

    Single

    125K

    No

    2

    No

    Married

    100K

    No

    3

    No

    Single

    70K

    No

    4

    Yes

    Married

    120K

    No

    5

    No

    Divorced

    95K

    Yes

    6

    No

    Married

    60K

    No

    7

    Yes

    Divorced

    220K

    No

    8

    No

    Single

    85K

    Yes

    9

    No

    Married

    75K

    No

    10

    No

    Single

    90K

    Yes

    10

  • Effect of Rule SimplificationRules are no longer mutually exclusiveA record may trigger more than one rule Solution? Ordered rule set Unordered rule set use voting schemes

    Rules are no longer exhaustiveA record may not trigger any rulesSolution? Use a default class

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Ordered Rule SetRules are rank ordered according to their priorityAn ordered rule set is known as a decision listWhen a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggeredIf none of the rules fired, it is assigned to the default class

    R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Rule Ordering SchemesRule-based orderingIndividual rules are ranked based on their qualityClass-based orderingRules that belong to the same class appear together

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Rule-based Ordering

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced}, Taxable Income No

    (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes

    (Refund=No, Marital Status={Married}) ==> No

    Class-based Ordering

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced}, Taxable Income No

    (Refund=No, Marital Status={Married}) ==> No

    (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes

  • Building Classification RulesDirect Method: Extract rules directly from data e.g.: RIPPER, CN2, Holtes 1R

    Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Direct Method: Sequential CoveringStart from an empty ruleGrow a rule using the Learn-One-Rule functionRemove training records covered by the ruleRepeat Step (2) and (3) until stopping criterion is met

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example of Sequential Covering

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example of Sequential Covering

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Aspects of Sequential CoveringRule Growing

    Instance Elimination

    Rule Evaluation

    Stopping Criterion

    Rule Pruning

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Rule GrowingTwo common strategies

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    (a) General-to-specific

    Refund= No

    Status = Single

    Status = Divorced

    Status = Married

    Income > 80K

    ...

    { }

    Yes: 0No: 3

    Yes: 3No: 4

    Yes: 3No: 4

    Yes: 2No: 1

    Yes: 1No: 0

    Yes: 3No: 1

    Refund=No,Status=Single,Income=85K(Class=Yes)

    Refund=No,Status=Single,Income=90K(Class=Yes)

    Refund=No,Status = Single(Class = Yes)

    (b) Specific-to-general

  • Rule Growing (Examples)CN2 Algorithm:Start from an empty conjunct: {}Add conjuncts that minimizes the entropy measure: {A}, {A,B}, Determine the rule consequent by taking majority class of instances covered by the ruleRIPPER Algorithm:Start from an empty rule: {} => classAdd conjuncts that maximizes FOILs information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1

    p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Instance EliminationWhy do we need to eliminate instances?Otherwise, the next rule is identical to previous ruleWhy do we remove positive instances?Ensure that the next rule is differentWhy do we remove negative instances?Prevent underestimating accuracy of ruleCompare rules R2 and R3 in the diagram

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    class = +

    class = -

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    -

    -

    -

    -

    -

    -

    -

    -

    -

    -

    -

    -

    -

    -

    +

    -

    -

    -

    -

    -

    -

    -

    +

    +

    +

    +

    +

    +

    +

    R1

    R3

    R2

    +

  • Rule EvaluationMetrics:Accuracy

    Laplace

    M-estimate

    n : Number of instances covered by rulenc : Number of instances covered by rulek : Number of classesp : Prior probability

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Stopping Criterion and Rule PruningStopping criterionCompute the gainIf gain is not significant, discard the new rule

    Rule PruningSimilar to post-pruning of decision treesReduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Summary of Direct MethodGrow a single rule

    Remove Instances from rule

    Prune the rule (if necessary)

    Add rule to Current Rule Set

    Repeat

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Direct Method: RIPPERFor 2-class problem, choose one of the classes as positive class, and the other as negative classLearn rules for positive classNegative class will be default classFor multi-class problemOrder the classes according to increasing class prevalence (fraction of instances that belong to a particular class)Learn the rule set for smallest class first, treat the rest as negative classRepeat with next smallest class as positive class

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Direct Method: RIPPERGrowing a rule:Start from empty ruleAdd conjuncts as long as they improve FOILs information gainStop when rule no longer covers negative examplesPrune the rule immediately using incremental reduced error pruningMeasure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set n: number of negative examples covered by the rule in the validation setPruning method: delete any final sequence of conditions that maximizes v

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Direct Method: RIPPERBuilding a Rule Set:Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both positive and negative examples covered by the ruleEach time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Direct Method: RIPPEROptimize the rule set:For each rule r in the rule set R Consider 2 alternative rules:

    Replacement rule (r*): grow new rule from scratchRevised rule(r): add conjuncts to extend the rule r Compare the rule set for r against the rule set for r* and r Choose rule set that minimizes MDL principleRepeat rule generation and rule optimization for the remaining positive examples

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Indirect Methods

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    P

    Rule Set

    r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +r4: (P=Yes,R=Yes,Q=No) ==> -r5: (P=Yes,R=Yes,Q=Yes) ==> +

    Q

    R

    Q

    -

    +

    +

    -

    +

    No

    No

    No

    No

    Yes

    Yes

    Yes

    Yes

  • Indirect Method: C4.5rulesExtract rules from an unpruned decision treeFor each rule, r: A y, consider an alternative rule r: A y where A is obtained by removing one of the conjuncts in ACompare the pessimistic error rate for r against all rsPrune if one of the rs has lower pessimistic error rateRepeat until we can no longer improve generalization error

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Indirect Method: C4.5rulesInstead of ordering the rules, order subsets of rules (class ordering)Each subset is a collection of rules with the same rule consequent (class)Compute description length of each subset Description length = L(error) + g L(model) g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5)

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    animals2

    NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

    humanyesnononoyesmammals

    pythonnoyesnononoreptiles

    salmonnoyesnoyesnofishes

    whaleyesnonoyesnomammals

    frognoyesnosometimesyesamphibians

    komodonoyesnonoyesreptiles

    batyesnoyesnoyesmammals

    pigeonnoyesyesnoyesbirds

    catyesnononoyesmammals

    leopard sharkyesnonoyesnofishes

    turtlenoyesnosometimesyesreptiles

    penguinnoyesnosometimesyesbirds

    porcupineyesnononoyesmammals

    eelnoyesnoyesnofishes

    salamandernoyesnosometimesyesamphibians

    gila monsternoyesnonoyesreptiles

    platypusnoyesnonoyesmammals

    owlnoyesyesnoyesbirds

    dolphinyesnonoyesnomammals

    eaglenoyesyesnoyesbirds

  • C4.5 versus C4.5rules versus RIPPERC4.5rules:(Give Birth=No, Can Fly=Yes) Birds(Give Birth=No, Live in Water=Yes) Fishes(Give Birth=Yes) Mammals(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles( ) AmphibiansRIPPER:(Live in Water=Yes) Fishes(Have Legs=No) Reptiles(Give Birth=No, Can Fly=No, Live In Water=No) Reptiles(Can Fly=Yes,Give Birth=No) Birds() Mammals

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • C4.5 versus C4.5rules versus RIPPERC4.5 and C4.5rules:RIPPER:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Sheet1

    PREDICTED CLASS

    AmphibiansFishesReptilesBirdsMammals

    ACTUALAmphibians00002

    CLASSFishes03000

    Reptiles00301

    Birds00121

    Mammals02104

    C4.5AmphibiansFishesReptilesBirdsMammals

    Amphibians20000

    Fishes02001

    Reptiles10300

    Birds10030

    Mammals00106

    C4.5rulesAmphibiansFishesReptilesBirdsMammals

    Amphibians20000

    Fishes02001

    Reptiles10300

    Birds10030

    Mammals00106

    Sheet2

    Sheet3

    Sheet1

    RIPPER:AmphibiansFishesReptilesBirdsMammals

    Amphibians00002

    Fishes03000

    Reptiles00301

    Birds00121

    Mammals02104

    PREDICTED CLASS

    AmphibiansFishesReptilesBirdsMammals

    ACTUALAmphibians20000

    CLASSFishes02001

    Reptiles10300

    Birds10030

    Mammals00106

    C4.5rulesAmphibiansFishesReptilesBirdsMammals

    Amphibians20000

    Fishes02001

    Reptiles10300

    Birds10030

    Mammals00106

    Sheet2

    Sheet3

  • Advantages of Rule-Based ClassifiersAs highly expressive as decision treesEasy to interpretEasy to generateCan classify new instances rapidlyPerformance comparable to decision trees

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Instance-Based Classifiers Store the training records Use training records to predict the class label of unseen cases

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Instance Based ClassifiersExamples:Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

    Nearest neighbor Uses k closest points (nearest neighbors) for performing classification

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nearest Neighbor ClassifiersBasic idea:If it walks like a duck, quacks like a duck, then its probably a duck

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nearest-Neighbor ClassifiersRequires three thingsThe set of stored recordsDistance Metric to compute distance between recordsThe value of k, the number of nearest neighbors to retrieve

    To classify an unknown record:Compute distance to other training recordsIdentify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Unknown record

  • Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • 1 nearest-neighborVoronoi Diagram

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nearest Neighbor ClassificationCompute distance between two points:Euclidean distance

    Determine the class from nearest neighbor listtake the majority vote of class labels among the k-nearest neighborsWeigh the vote according to distance weight factor, w = 1/d2

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nearest Neighbor ClassificationChoosing the value of k:If k is too small, sensitive to noise pointsIf k is too large, neighborhood may include points from other classes

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    X

  • Nearest Neighbor ClassificationScaling issuesAttributes may have to be scaled to prevent distance measures from being dominated by one of the attributesExample: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nearest Neighbor ClassificationProblem with Euclidean measure:High dimensional data curse of dimensionalityCan produce counter-intuitive results

    1 1 1 1 1 1 1 1 1 1 1 00 1 1 1 1 1 1 1 1 1 1 11 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1vsd = 1.4142d = 1.4142 Solution: Normalize the vectors to unit length

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nearest neighbor Classificationk-NN classifiers are lazy learners It does not build models explicitlyUnlike eager learners such as decision tree induction and rule-based systemsClassifying unknown records are relatively expensive

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example: PEBLSPEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)Works with both continuous and nominal featuresFor nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)Each record is assigned a weight factorNumber of nearest neighbor, k = 1

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example: PEBLSDistance between nominal attribute values:d(Single,Married) = | 2/4 0/4 | + | 2/4 4/4 | = 1d(Single,Divorced) = | 2/4 1/2 | + | 2/4 1/2 | = 0d(Married,Divorced) = | 0/4 1/2 | + | 4/4 1/2 | = 1d(Refund=Yes,Refund=No)= | 0/3 3/7 | + | 3/3 4/7 | = 6/7

    ClassMarital StatusSingleMarriedDivorcedYes201No241

    ClassRefundYesNoYes03No34

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    1

    Yes

    Single

    125K

    No

    2

    No

    Married

    100K

    No

    3

    No

    Single

    70K

    No

    4

    Yes

    Married

    120K

    No

    5

    No

    Divorced

    95K

    Yes

    6

    No

    Married

    60K

    No

    7

    Yes

    Divorced

    220K

    No

    8

    No

    Single

    85K

    Yes

    9

    No

    Married

    75K

    No

    10

    No

    Single

    90K

    Yes

    10

  • Example: PEBLSDistance between record X and record Y: where: wX 1 if X makes accurate prediction most of the timewX > 1 if X is not reliable for making predictions

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    X

    Yes

    Single

    125K

    No

    Y

    No

    Married

    100K

    No

    10

  • Bayes ClassifierA probabilistic framework for solving classification problemsConditional Probability:

    Bayes theorem:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example of Bayes TheoremGiven: A doctor knows that meningitis causes stiff neck 50% of the timePrior probability of any patient having meningitis is 1/50,000Prior probability of any patient having stiff neck is 1/20

    If a patient has stiff neck, whats the probability he/she has meningitis?

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Bayesian ClassifiersConsider each attribute and class label as random variables

    Given a record with attributes (A1, A2,,An) Goal is to predict class CSpecifically, we want to find the value of C that maximizes P(C| A1, A2,,An )

    Can we estimate P(C| A1, A2,,An ) directly from data?

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Bayesian ClassifiersApproach:compute the posterior probability P(C | A1, A2, , An) for all values of C using the Bayes theorem

    Choose value of C that maximizes P(C | A1, A2, , An)

    Equivalent to choosing value of C that maximizes P(A1, A2, , An|C) P(C)

    How to estimate P(A1, A2, , An | C )?

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nave Bayes ClassifierAssume independence among attributes Ai when class is given: P(A1, A2, , An |C) = P(A1| Cj) P(A2| Cj) P(An| Cj)

    Can estimate P(Ai| Cj) for all Ai and Cj.

    New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal.

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • How to Estimate Probabilities from Data?Class: P(C) = Nc/Ne.g., P(No) = 7/10, P(Yes) = 3/10

    For discrete attributes: P(Ai | Ck) = |Aik|/ Nc

    where |Aik| is number of instances having attribute Ai and belongs to class CkExamples:

    P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0k

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • How to Estimate Probabilities from Data?For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumptionTwo-way split: (A < v) or (A > v) choose only one of the two splits as new attributeProbability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)

    k

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • How to Estimate Probabilities from Data?Normal distribution:

    One for each (Ai,ci) pair

    For (Income, Class=No):If Class=No sample mean = 110 sample variance = 2975

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example of Nave Bayes ClassifierP(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024

    P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1 0 1.2 10-9 = 0

    Since P(X|No)P(No) > P(X|Yes)P(Yes)Therefore P(No|X) > P(Yes|X) => Class = NoGiven a Test Record:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nave Bayes ClassifierIf one of the conditional probability is zero, then the entire expression becomes zeroProbability estimation:

    c: number of classesp: prior probabilitym: parameter

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example of Nave Bayes ClassifierA: attributesM: mammalsN: non-mammalsP(A|M)P(M) > P(A|N)P(N)=> Mammals

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    animals2

    NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

    humanyesnononoyesmammals

    pythonnoyesnononoreptiles

    salmonnoyesnoyesnofishes

    whaleyesnonoyesnomammals

    frognoyesnosometimesyesamphibians

    komodonoyesnonoyesreptiles

    batyesnoyesnoyesmammals

    pigeonnoyesyesnoyesbirds

    catyesnononoyesmammals

    leopard sharkyesnonoyesnofishes

    turtlenoyesnosometimesyesreptiles

    penguinnoyesnosometimesyesbirds

    porcupineyesnononoyesmammals

    eelnoyesnoyesnofishes

    salamandernoyesnosometimesyesamphibians

    gila monsternoyesnonoyesreptiles

    platypusnoyesnonoyesmammals

    owlnoyesyesnoyesbirds

    dolphinyesnonoyesnomammals

    eaglenoyesyesnoyesbirds

    NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

    humanyesnononoyesmammals

    pythonnoyesnonononon-mammals

    salmonnoyesnoyesnonon-mammals

    whaleyesnonoyesnomammals

    frognoyesnosometimesyesnon-mammals

    komodonoyesnonoyesnon-mammals

    batyesnoyesnoyesmammals

    pigeonnoyesyesnoyesnon-mammals

    catyesnononoyesmammals

    leopard sharkyesnonoyesnonon-mammals

    turtlenoyesnosometimesyesnon-mammals

    penguinnoyesnosometimesyesnon-mammals

    porcupineyesnononoyesmammals

    eelnoyesnoyesnonon-mammals

    salamandernoyesnosometimesyesnon-mammals

    gila monsternoyesnonoyesnon-mammals

    platypusnoyesnonoyesmammals

    owlnoyesyesnoyesnon-mammals

    dolphinyesnonoyesnomammals

    eaglenoyesyesnoyesnon-mammals

    animals2

    NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

    humanyesnononoyesmammals

    pythonnoyesnononoreptiles

    salmonnoyesnoyesnofishes

    whaleyesnonoyesnomammals

    frognoyesnosometimesyesamphibians

    komodonoyesnonoyesreptiles

    batyesnoyesnoyesmammals

    pigeonnoyesyesnoyesbirds

    catyesnononoyesmammals

    leopard sharkyesnonoyesnofishes

    turtlenoyesnosometimesyesreptiles

    penguinnoyesnosometimesyesbirds

    porcupineyesnononoyesmammals

    eelnoyesnoyesnofishes

    salamandernoyesnosometimesyesamphibians

    gila monsternoyesnonoyesreptiles

    platypusnoyesnonoyesmammals

    owlnoyesyesnoyesbirds

    dolphinyesnonoyesnomammals

    eaglenoyesyesnoyesbirds

    NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

    humanyesnononoyesmammals

    pythonnoyesnonononon-mammals

    salmonnoyesnoyesnonon-mammals

    whaleyesnonoyesnomammals

    frognoyesnosometimesyesnon-mammals

    komodonoyesnonoyesnon-mammals

    batyesnoyesnoyesmammals

    pigeonnoyesyesnoyesnon-mammals

    catyesnononoyesmammals

    leopard sharkyesnonoyesnonon-mammals

    turtlenoyesnosometimesyesnon-mammals

    penguinnoyesnosometimesyesnon-mammals

    porcupineyesnononoyesmammals

    eelnoyesnoyesnonon-mammals

    salamandernoyesnosometimesyesnon-mammals

    gila monsternoyesnonoyesnon-mammals

    platypusnoyesnonoyesmammals

    owlnoyesyesnoyesnon-mammals

    dolphinyesnonoyesnomammals

    eaglenoyesyesnoyesnon-mammals

    NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass

    humanyesnonoyesno?

  • Nave Bayes (Summary)Robust to isolated noise points

    Handle missing values by ignoring the instance during probability estimate calculations

    Robust to irrelevant attributes

    Independence assumption may not hold for some attributesUse other techniques such as Bayesian Belief Networks (BBN)

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Artificial Neural Networks (ANN)Output Y is 1 if at least two of the three inputs are equal to 1.

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    X1

    X2

    X3

    Y

    Black box

    Output

    Input

  • Artificial Neural Networks (ANN)

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    S

    X1

    X2

    X3

    Y

    Black box

    0.3

    0.3

    0.3

    t=0.4

    Output node

    Input nodes

  • Artificial Neural Networks (ANN)Model is an assembly of inter-connected nodes and weighted links

    Output node sums up each of its input value according to the weights of its links

    Compare output node against some threshold t

    Perceptron Modelor

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    S

    X1

    X2

    X3

    Y

    Black box

    w1

    w2

    w3

    t

    Output node

    Input nodes

  • General Structure of ANNTraining ANN means learning the weights of the neurons

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Activation function g(Si )

    Si

    Oi

    I1

    I2

    I3

    wi1

    wi2

    wi3

    Oi

    Neuron i

    Input

    Output

    threshold, t

    Input Layer

    Hidden Layer

    Output Layer

    x1

    x2

    x3

    x4

    x5

    y

  • Algorithm for learning ANNInitialize the weights (w0, w1, , wk)

    Adjust the weights in such a way that the output of ANN is consistent with class labels of training examplesObjective function:

    Find the weights wis that minimize the above objective function e.g., backpropagation algorithm (see lecture notes)

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Support Vector MachinesFind a linear hyperplane (decision boundary) that will separate the data

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Support Vector MachinesOne Possible Solution

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    B1

  • Support Vector MachinesAnother possible solution

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    B2

  • Support Vector MachinesOther possible solutions

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    B2

  • Support Vector MachinesWhich one is better? B1 or B2?How do you define better?

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    B1

    B2

  • Support Vector MachinesFind hyperplane maximizes the margin => B1 is better than B2

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    B1

    B2

    b11

    b12

    b21

    b22

    margin

  • Support Vector Machines

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    B1

    b11

    b12

  • Support Vector MachinesWe want to maximize:

    Which is equivalent to minimizing:

    But subjected to the following constraints:

    This is a constrained optimization problem

    Numerical approaches to solve it (e.g., quadratic programming)

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Support Vector MachinesWhat if the problem is not linearly separable?

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Support Vector MachinesWhat if the problem is not linearly separable?Introduce slack variables Need to minimize:

    Subject to:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nonlinear Support Vector MachinesWhat if decision boundary is not linear?

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Nonlinear Support Vector MachinesTransform data into higher dimensional space

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Ensemble MethodsConstruct a set of classifiers from the training data

    Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • General Idea

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Original Training data

    ....

    D1

    D2

    Dt-1

    Dt

    D

    Step 1: Create Multiple Data Sets

    C1

    C2

    Ct -1

    Ct

    Step 2: Build Multiple Classifiers

    C*

    Step 3: Combine Classifiers

  • Why does it work?Suppose there are 25 base classifiersEach classifier has error rate, = 0.35Assume classifiers are independentProbability that the ensemble classifier makes a wrong prediction:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Examples of Ensemble MethodsHow to generate an ensemble of classifiers?Bagging

    Boosting

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • BaggingSampling with replacement

    Build classifier on each bootstrap sample

    Each sample has probability (1 1/n)n of being selected

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • BoostingAn iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified recordsInitially, all N records are assigned equal weightsUnlike bagging, weights may change at the end of boosting round

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • BoostingRecords that are wrongly classified will have their weights increasedRecords that are classified correctly will have their weights decreased

    Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example: AdaBoostBase classifiers: C1, C2, , CT

    Error rate:

    Importance of a classifier:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Example: AdaBoostWeight update:

    If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeatedClassification:

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

  • Illustrating AdaBoost

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Boosting Round 1

    +

    +

    +

    -

    -

    -

    -

    -

    -

    -

    B1

    0.0094

    0.0094

    0.4623

    a = 1.9459

  • Illustrating AdaBoost

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

    Boosting Round 1

    +

    +

    +

    -

    -

    -

    -

    -

    -

    -

    Boosting Round 2

    0.3037

    -

    -

    -

    -

    -

    -

    -

    -

    +

    +

    Boosting Round 3

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    Overall

    +

    +

    +

    -

    -

    -

    -

    -

    +

    +

    0.0009

    0.0422

    0.0276

    0.1819

    0.0038

    B1

    0.0094

    0.0094

    0.4623

    B2

    B3

    a = 1.9459

    a = 2.9323

    a = 3.8744

    (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002