ID3 Algorithm & ROC Analysis

Embed Size (px)

Citation preview

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    1/51

    ID3 Algorithm &

    ROC AnalysisTalha KABAKU

    [email protected]

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    2/51

    Agenda

    Where are we now? Decision Trees What is ID3?

    Entropy Information Gain Pros and Cons of ID3 An Example - The Simpsons What is ROC Analysis? ROC Space ROC Space Example over predictions

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    3/51

    Where are we now?

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    4/51

    Decision Trees

    One of the most used classification approach because ofits clear model and presentation

    Classification by using data attributes Aim is to reaching estimating destination field

    value using source fields Tree Induction

    Create tree Apply data into tree to classify

    Each branch node represents a choice between anumber of alternatives Each leaf node represents a classification or decision Leaf Count = Rule Count

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    5/51

    Decision Trees (Cont.)

    Leafs are inserted through top to bottom

    A

    B C

    GFED

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    6/51

    Sample Decision Tree

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    7/51

    Creating Tree Model by Training Data

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    8/51

    Decision Tree Classification Task

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    9/51

    Apply Model to Test Data

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    10/51

    Apply Model to Test Data (Cont.)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    11/51

    Apply Model to Test Data (Cont.)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    12/51

    Apply Model to Test Data (Cont.)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    13/51

    Apply Model to Test Data (Cont.)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    14/51

    Apply Model to Test Data (Cont.)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    15/51

    Decision Tree Algorithms

    Classification and RegressionAlgorithms

    Twoig Gini

    Entropy-based Algorithms

    ID3 C4.5 Memory-based (Sample-based)

    Classification Algorithms

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    16/51

    Decision Trees by Variable Type

    Single Variable Decision Trees Classifications are done with asking

    questions over only one variable

    Hybrid Decision Trees Classifications are done with asking

    questions over both single and multiplevariables

    Multiple Variables Decision Trees Classifications are done with asking

    questions over multiple variables

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    17/51

    ID3 Algorithm

    Iterative Dichotomizer 3 Developed by J. Ross Quinlan in 1979 Based on Entropy

    Only works for discrete data Can not work with defective data Advantage over Hunt's algorithm is choosing

    the right attribute while classification.

    (Hunt's algorithm chooses randomly)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    18/51

    Entropy

    A formula to calculate the homogeneity of asample; gives idea about how muchinformation gain provides each leaf

    A complete homogeneous sampleentropy value is 0

    An equally divided sample entropy value is 1 Formula:

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    19/51

    Information Gain (IG)

    Information Gain calculates effective changein entropy after making a decision based onthe value of an attribute.

    Which attribute creates the mosthomogeneous branches?

    First the entropy of the total dataset iscalculated.

    The dataset is then split on the differentattributes.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    20/51

    Information Gain (Cont.)

    The entropy for each branch is calculated.Then it is added proportionally, to get totalentropy for the split.

    The resulting entropy is subtracted from theentropy before the split.

    The result is the Information Gain, ordecrease in entropy.

    The attribute that yields the largest IG ischosen for the decision node.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    21/51

    Information Gain (Cont.)

    A branch set with entropy of 0 is aleaf node.

    Otherwise, the branch needs furthersplitting to classify its dataset. The ID3 algorithm is run recursively

    on the non-leaf branches, until all datais classified.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    22/51

    ID3 Algorithm Steps

    functionID3 (R: a set of non-categorical attributes,

    C: the categorical attribute,

    S: a training set) returns a decision tree;

    begin

    IfS is empty, return a single node with value Failure;

    IfS consists of records all with the same value for

    the categorical attribute,

    return a single node with that value;IfR is empty, then return a single node with as value

    the most frequent of the values of the categorical attribute

    that are found in records of S; [note that then there

    will be errors, that is, records that will be improperly

    classified];

    Let D be the attribute with largest Gain( D,S)

    among attributes in R;

    Let {dj| j=1,2, .., m} be the values of attribute D;

    Let {Sj| j=1,2, .., m} be the subsets of S consisting

    respectively of records with value dj for attribute D;

    Return a tree with root labeled D and arcs labeled

    d1, d2, .., dm going respectively to the trees

    ID3(R-{D}, C, S1), ID3(R-{D}, C, S2), .., ID3(R-{D}, C, Sm);

    endID3;

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    23/51

    Pros of ID3 Algorithm

    Builds decision tree in min. steps The most important point while tree

    induction is collecting enough reliable

    associated data over specific properties. Asking right questions determines tree

    induction. Each level benefits from previous level

    choices Whole dataset is scanned to create tree

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    24/51

    Cons of ID3 Algorithm

    Tree can not be updated when newdata is classified incorrectly, instead

    a new tree must be generated. Only one attribute at a time is testedfor making a decision.

    Can not work with defective data Can not work with numerical

    attributes

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    25/51

    An Example - The Simpsons

    Person Hair Length Weight Age Class

    Homer 0'' 250 36 M

    Marge 10'' 150 34 F

    Bart 2'' 90 10 M

    Lisa 6'' 78 8 M

    Maggie 4'' 20 1 F

    Abe 1'' 170 70 F

    Selma 8'' 160 41 F

    Otto 10'' 180 38 M

    Krusty 6'' 200 45 M

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    26/51

    Information Gain over Hair Length

    E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain

    E(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.9710 E(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5)=0.8113

    Gain(Hair Length

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    27/51

    Information Gain over Weight

    E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain

    E(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 E(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0

    Gain(Weight

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    28/51

    Information Gain over Age

    E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain

    E(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 E(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3)= 0.9188

    Gain(Age z= 40) = 0.9911 (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

    Age

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    29/51

    Results

    As seen in the results,weight is the best

    attribute to classify these group.

    Attribute Information Gain (IG)

    Hair Length

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    30/51

    Constructed Decision Tree

    Weight

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    31/51

    Entropy over Nominal Values

    If an attribute has nominal values: First calculate information gain for each attribute

    value Then calculate attribute information gain

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    32/51

    Example II

    IG= -(5/15)log2(5/15)-(10/15)log2(10/15) = ~0.918

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    33/51

    Example II (Cont.)

    Information Gain over Engine

    Engine: 6 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy

    calculations

    small: 5 no, 1 yes IGsmall = -(5/6)log2(5/6)-(1/6)log2(1/6) = ~0.65

    medium: 3 no, 2 yes IGmedium = -(3/5)log2(3/5)-(2/5)log2(2/5) = ~0.97

    large: 2 no, 2 yes

    IGlarge = 1 (evenly distributed subset)=> IGEngine = IE(S) [(6/15)*IGsmall + (5/15)*IGmedium +(4/15)*Ilarge]

    = IGEngine = 0.918 0.85 = 0.068

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    34/51

    Example II (Cont.)

    Information Gain over SC/Turbo

    SC/Turbo: 4 yes, 11 no 2 values for attribute SC/Turbo, so we need 2 entropy

    calculations

    yes: 2 yes, 2 no IGturbo = 1 (evenly distributed subset)

    no: 3 yes, 8 no IGnoturbo = -(3/11)log2(3/11)-(8/11)log2(8/11) = ~0.84

    IGturbo = IE(S) [(4/15)*IGturbo + (11/15)*IGnoturbo]

    IGturbo = 0.918 0.886 = 0.032

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    35/51

    Example II (Cont.)

    Information Gain over Weight

    Weight: 6 Average, 4 Light, 5 Heavy 3 values for attribute weight, so we need 3 entropy

    calculations

    average: 3 no, 3 yes IGaverage = 1 (evenly distributed subset)

    light: 3 no, 1 yes IGlight = -(3/4)log2(3/4)-(1/4)log2(1/4) = ~0.81

    heavy: 4 no, 1 yes

    IGheavy = -(4/5)log2(4/5)-(1/5)log2(1/5) = ~0.72

    IGWeight = IE(S) [(6/15)*IGaverage + (4/15)*IGlight + (5/15)*IGheavy]

    IGWeight = 0.918 0.856 = 0.062

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    36/51

    Example II (Cont.)

    Information Gain over Full Eco

    Fuel Economy: 2 good, 3 average, 10 bad 3 values for attribute Fuel Eco, so we need 3 entropy

    calculations

    good: 0 yes, 2 no IGgood = 0 (no variability)

    average: 0 yes, 3 no IGaverage = 0 (no variability)

    bad: 5 yes, 5 no

    IGbad = 1 (evenly distributed subset)We can omit calculations for good and average since they always

    end up not fast.

    IGFuelEco = IE(S) [(10/15)*IGbad]

    IGFuelEco = 0.918 0.667 = 0.251

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    37/51

    Example II (Cont.)

    Results: Root of the tree

    IGEngine 0.068

    IGturbo 0.032

    IGWeight 0.062

    IGFuelEco 0.251

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    38/51

    Example II (Cont.)

    Since we selected the Fuel Eco attribute for our Root Node, itis removed from the table for future calculations.

    General Information Gain = 1 (Evenly distributed set)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    39/51

    Example II (Cont.)

    Information Gain over Engine Engine: 1 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 1 yes, 0 no

    IGsmall = 0 (no variability) medium: 2 yes, 3 no IGmedium = -(2/5)log2(2/5)-(3/5)log2(3/5) = ~0.97

    large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset)

    IGEngine = IE(SFuelEco) (5/10)*IGmedium + (4/10)*IGlarge]

    IGEngine = 1 0.885 = 0.115

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    40/51

    Example II (Cont.)

    Information Gain over SC/Turbo SC/Turbo: 3 yes, 7 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 1 no

    IGturbo = -(2/3)log2(2/3)-(1/3)log2(1/3) = ~0.84 no: 3 yes, 4 no IGnoturbo = -(3/7)log2(3/7)-(4/7)log2(4/7) = ~0.84

    IGturbo = IE(SFuelEco) [(3/10)*IGturbo + (7/10)*IGnoturbo]

    IGturbo = 1 0.965 = 0.035

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    41/51

    Example II (Cont.)

    Information Gain over Weight Weight: 3 average, 5 heavy, 2 light 3 values for attribute weight, so we need 3 entropy calculations average: 3 yes, 0 no

    IGaverage = 0 (no variability) heavy: 1 yes, 4 no IGheavy = -(1/5)log2(1/5)-(4/5)log2(4/5) = ~0.72

    light: 1 yes, 1 no IlGight = 1 (evenly distributed subset)

    IGEngine = IE(SFuel Eco) [(5/10)*IGheavy+(2/10)*IGlight]

    IGEngine = 1 0.561 = 0.439

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    42/51

    Example II (Cont.)

    Results:

    Weight has the highest gain, and is thus thebest choice.

    IGEngine 0.115

    IGturbo 0.035

    IGWeight 0.439

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    43/51

    Example II (Cont.)

    Since there are only two items for SC/Turbo whereWeight = Light, and the result is consistent, we cansimplify theweight = Light path.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    44/51

    Example II (Cont.)

    Updated Table: (Weight = Heavy)

    All cars with large engines in this table are not fast. Due to inconsistent patterns in the data, there is no way toproceed since medium size engines may lead toeither fast or not fast.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    45/51

    ROC Analysis

    Receiver Operating Characteristic The limitations of diagnostic accuracy as a measure

    of decision performance require introduction of theconcepts of the sensitivity and specificity of a

    diagnostic test. These measures and the relatedindices, true positive rate and false positiverate, are more meaningful than accuracy.

    ROC curve is shown to be a complete description of

    this decision threshold effect, indicating all possiblecombinations of the relative frequencies of the variouskinds of correct and incorrect decisions.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    46/51

    ROC Analysis (Cont.)

    Combinations of correct & incorrect decisions:

    TPRis equivalent with sensitivity. FPRis equivalent with 1 - specificity. Best possible prediction would be 100% sensitivity

    and 100% specificity(which means FPR = 0%).

    Actual Value Prediction Outcome Description

    p p True Positive Rate (TPR)

    p n False Negative Rate (FNR)

    n p False Positive Rate (FPR)

    n n True Negative Rate (TNR)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    47/51

    ROC Space

    A ROC space is defined byFPRand TPRasxand y axes respectively, which depicts relativetrade-offs between true positive (benefits) and

    false positive (costs). Since TPRis equivalent with sensitivityandFPRis equal to 1 specificity, the ROC graphis sometimes called the sensitivity vs (1

    specificity) plot. Each prediction result one point in the ROCspace.

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    48/51

    Calculations

    Sensitivity TPR= TP / P = TP / (TP + FN)

    Specificity FPR= FP / N = FP / (FP + TN) Accuracy ACC = (TP + TN) / (P + N)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    49/51

    A ROC Space Example

    Let A, B, C, D to be predictions over 100negative and 100 positive instance:

    Prediction/

    Combination

    TP FP FN TN TPR FPR ACC

    A 63 28 37 72 0.63 0.28 0.68

    B 77 77 23 23 0.77 0.77 0.50

    C 24 88 76 12 0.24 0.88 0.18

    D 76 12 24 88 0.76 0.12 0.82

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    50/51

    A ROC Space Example (Cont.)

  • 7/23/2019 ID3 Algorithm & ROC Analysis

    51/51

    References

    1. Data Mining Course Lectures, Ass. Prof. NilferYurtay

    2. Quinlan, J.R. 1986, Machine Learning, 1, 813. http://www.cse.unsw.edu.

    au/~billw/cs9414/notes/ml/06prop/id3/id3.html4. J. Han, M. Kamber, J. Pie, Data Mining Concepts and

    Techniques, 3rd Edition, Elsevier, 2011.5. http://www.cise.ufl.edu/~ddd/cap6635/Fall-

    97/Short-papers/2.htm6. C. E. Metz, Basic Principles of ROC Analysis,Seminars in Nuclear Medicine, Volume 8, Issue 4, P283-298

    http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htmhttp://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htmhttp://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.htmlhttp://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.htmlhttp://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htmhttp://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htmhttp://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.htmlhttp://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html