94
Introduction to Data Mining Modelling Algorithms

Introduction to Data Mining Modelling Algorithmsdml.cs.byu.edu/~cgc/docs/atdm/Readings/DMAlgOverview.pdf · 2008. 2. 25. · Information Gain Entropy: – Let S be a set examples

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Introduction to Data Mining Modelling Algorithms

  • Aim and Scope● Brief introduction to main concepts and

    modelling algorithms for DM applications● Enough background to use existing

    commercial/freeware tools on simple tasks● Not exhaustive: only representatives of

    each class of models are presented

  • Data Mining Approaches● Predictive Modeling

    – Decision trees: ID3– Instance-based: k-NN– Probabilistic: NB, Linear/Logistic Regression– Connectionist: Backpropagation

    ● Descriptive Modeling– Summarization– Clustering: k-Means, COBWEB, Competitive Learning– Association Rule: Apriori

    ● Genetic Algorithms

  • Decision Tree

    ● Structure:– Internal nodes tests on some property– Branches values of associated property– Leaf nodes classifications– Classification: traverse tree from root to a

    leaf● Goal: construct a decision tree that allows

    the classification of objects from examples– Occam's Razor

  • ID3 Algorithm

    ● Function ID3(Example-set, Properties)– If all elements in Example-set are in the same class, then return

    a leaf node labeled with that class– Else if Properties is empty, then return a leaf node labeled with

    the majority class in Example-set– Else

    ● Select P from Properties so as to maximize information gain● Remove P from Properties● Make P the root of the current tree● For each value V of P

    – Create a branch of the current tree labeled by V– Partition_V Elements of Example-set with value V for P– ID3(Partition_V, Properties)– Attach result to branch V

  • Information Gain

    ● Entropy:– Let S be a set examples from c classes

    Entropy S =∑i=1

    c

    −p i log2 p i where pi is the proportion of examples

    of S belonging to class i (Note, we define 0log0=0)

    The smaller the entropy, the “purer” the partitionThe smaller the entropy, the “purer” the partition Information Gain:Information Gain:

    Let p be a property with n outcomesLet p be a property with n outcomes Partitioning S based on P results in a gain of:Partitioning S based on P results in a gain of:

    Gain S,p =Entropy S −∑i= 1

    n ∣S i∣∣S∣

    Entropy S i where Si is the subset of S for which

    property p has its ith value

  • Illustrative Training SetRisk Assessment for Loan Applications

    Client # Credit History Debt Level Collateral Income Level RISK LEVEL1 Bad High None Low HIGH2 Unknown High None Medium HIGH3 Unknown Low None Medium MODERATE4 Unknown Low None Low HIGH5 Unknown Low None High LOW6 Unknown Low Adequate High LOW7 Bad Low None Low HIGH8 Bad Low Adequate High MODERATE9 Good Low None High LOW

    10 Good High Adequate High LOW11 Good High None Low HIGH12 Good High None Medium MODERATE13 Good High None High LOW14 Bad High None Medium HIGH

  • ID3 Example (I)

    1) Choose Income as root of tree.

    Income High

    Medium Low

    1,4,7,11 2,3,12,14 5,6,8,9,10,13 4) 2) 3)

    2) All examples are in the same class, HIGH. Return Leaf Node.

    Income High

    Medium Low

    HIGH 2,3,12,14 5,6,8,9,10,13

    3) Choose Debt Level as root of subtree.

    Debt Level High Low

    3 2,12,14 3b) 3a)

    3a) All examples are in the same class, MODERATE. Return Leaf node.

    Debt Level High Low

    MODERATE 2,12,14 3b)

  • ID3 Example (II)

    3b) Choose Credit History as root of subtree.

    Credit History Good

    Bad Unknown

    2 14 12 3b’’’)

    3b’’) 3b’)

    3b’-3b’’’) All examples are in the same class. Return Leaf nodes.

    Credit History Good

    Bad Unknown

    HIGH HIGH MODERATE

    4) Choose Credit History as root of subtree.

    Credit History Good

    Bad Unknown

    5,6 8 9,10,13 4c) 4b) 4a)

    4a-4c) All examples are in the same class. Return Leaf nodes.

    Credit History Good

    Bad Unknown

    LOW MODERATE LOW

  • ID3 Example (III) Attach subtrees at appropr iate places.

    Income

    High Medium Low

    HIGH Debt Level High Low

    MODERA TE Credit History

    Good Bad Unknown

    HIGH HIGH MODERA TE

    Credit History Good

    Bad Unknown

    LOW MODERA TE LOW

  • Discussion

    ● ID3 handles only discrete attributes: extensions to numerical attributes have been proposed, the most famous being C4.5/C5.0

    ● Experience shows that decision trees tend to produce very good results on many problems

    ● Trees are most attractive when end users want interpretable knowledge from their data

    ● Overfitting likely: use pruning

  • Instance-based

    ● Does not build an explicit model● Training data is stored and, when a new

    query instance is encountered, a set of similar, related instances is retrieved from memory and used to classify the new query instance

    ● Lazy learning: simply compute the classification of each new query instance as needed

  • k-NN Algorithm

    ● For each training instance t=(x, f(x))– Add t to the set Tr_instances

    ● Given a query instance q to be classified– Let x1, …, xk be the k instances in Tr_instances nearest to q– Return

    Intuitively, the k-NN algorithm assigns to each new query instance the majority class among its k nearest neighbors

    where V is the finite set of target class values, and δ(a,b)=1 if a=b, and 0 otherwise

    f q =argmaxv∈V

    ∑i=1

    k

    δ v , f x i

  • Illustration

    q is + under 1-NN, but – under 5-NN

    q

    +

    ++-

    --

    ---

    +-

  • Discussion● Works well on many practical problems and

    fairly noise tolerant (depending on k)● Some issues:

    – Irrelevant attributes dominate the distance function: use weighted distance function or remove irrelevant attributes

    – Long query processing: implement indexing schemes– Large memory requirement: implement instance

    reduction strategies– Non-Euclidean/heterogeneous spaces:

    extend/design adequate distance measures

  • Probabilistic Learning

    ● We are often interested in determining the best hypothesis from some space H, given the observed training data D.

    ● One way to specify what is meant by the best hypothesis is to say that we demand the most probable hypothesis, given the data D together with any initial knowledge about the prior probabilities of the various hypotheses in H.

  • Remark

    ● Instead of finding the most probable hypothesis given the training data, often, it is the following related question that is most relevant:– Which is the most probable classification of

    the new query instance given the training data?

    ● In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.

  • Bayes Optimal Classification

    ● If the possible classification of the new instance can take on any value vj from some set V, then the probability P(vj|D) that the correct classification for the new instance is vj , is given by:

    P v j∣D =∑h i∈H

    P v j∣h i P hi∣D

    Then, the optimal classification of the new instance is:

    Nice, but impractical for large hypothesis spaces

    argmaxv j∈V

    ∑hi∈H

    P v j∣hi P h i∣D

  • Naïve Bayes Learning

    ● A practical Bayesian learning method● Applies when instances are conjunctions of

    attribute values and the target takes its values from some finite set V

    ● Consists in assigning to a new query instance the most probable target value, vMAP, given the attribute values a1, …, an that describe the instance, i.e.,

    vMAP=argmaxv j∈V

    P v j∣a1 , , an

  • NB Algorithm

    ● Using Bayes theorem, this can be reformulated as:

    Making the further simplifying assumption that attribute values are conditionally independent given the target value, one can write the conjunctive conditional probability as a product of simple conditional probabilities, producing the algorithm:

    ReturnReturn argmaxv j∈V

    P v j ∏i=1

    n

    P a i∣v j

    argmaxv j∈V

    P a1 ,... , an∣v j P v j P a1 ,... , an

  • Illustrative Training SetRisk Assessment for Loan Applications

    Client # Credit History Debt Level Collateral Income Level RISK LEVEL1 Bad High None Low HIGH2 Unknown High None Medium HIGH3 Unknown Low None Medium MODERATE4 Unknown Low None Low HIGH5 Unknown Low None High LOW6 Unknown Low Adequate High LOW7 Bad Low None Low HIGH8 Bad Low Adequate High MODERATE9 Good Low None High LOW

    10 Good High Adequate High LOW11 Good High None Low HIGH12 Good High None Medium MODERATE13 Good High None High LOW14 Bad High None Medium HIGH

  • NB ExampleCredit History Risk Level

    High Moderate Low High Moderate LowUnknown 0.33 0.33 0.40 6 3 5 14Bad 0.50 0.33 0.00 0.43 0.21 0.36 1.00Good 0.17 0.33 0.60

    Debt Level High Moderate LowHigh 0.67 0.33 0.40 Consider the query instance: (Bad, Low, Adequate, Medium)Low 0.33 0.67 0.60

    High 0.00%Collateral High Moderate Low Moderate 1.06%

    None 1.00 0.67 0.60 Low 0.00%Adequate 0.00 0.33 0.40 Prediction: Moderate

    Income Level High Moderate LowHigh 0.00 0.33 1.00 Consider the query instance: (Bad, High, None, Low) - SeenMedium 0.33 0.67 0.00Low 0.67 0.00 0.00 High 9.52%

    Moderate 0.00%Low 0.00%

    Prediction: High

  • Discussion

    Whenever the assumption of conditional independence is satisfied, NB classification is optimal

    NB is inherently incremental NB estimates probabilities by frequencies Assume P(X=x|Y=y)=0.05 and the training set is s.t. ny=5.

    Then, it is highly probable that nx|y=0. The fraction is thus an underestimate of the probability. It will “dominate” the NB classifier for all new queries with X=x.

    Use m-estimate: replace nx|y/ny by (nx|y+mp)/(ny+m), where p is our prior estimate of the probability we wish to determine and m is a constant (typically, p=1/(number of values of X) and m acts as a weight, similar to adding m virtual instances distributed according to p)

  • Linear Regression● Fit a linear model to data where the dependent

    variable is continuous:

    ● Given a set of points (Xi,Yi), we wish to find a linear function (or line in 2 dimensions) that “goes through” these points.

    ● In general, the points are not exactly aligned:– Find line that best fits the points

    Y=β0+β1 X 1+β2 X 2+K+βn X n+e

  • Sum-squared Error (SSE)

    The smaller the SSE, the better the fitThe smaller the SSE, the better the fit Hence, linear regression attempts to Hence, linear regression attempts to

    minimize SSE, or similarly to maximize Rminimize SSE, or similarly to maximize R22

    SSE=∑y yobserved− y predicted

    2

    TSS=∑y y observed−yobserved

    2R2=1−SSE

    TSS

  • Analytical Solution

    β0=∑ y−β1∑ xn

    β1=n∑ xy−∑ x∑ yn∑ x2−∑ x 2

    In 2 dimensions:

    In n>2 dimensions: matrix arithmetic

  • Illustration

    223.6195.3158.0024.1066.0030.2512.005.5047.8421.1610.404.6040.4016.0010.104.0027.2011.568.003.4024.499.617.903.1012.885.295.602.304.801.444.001.20xyx^2yx

    (target: y=2x+1.5)x y (obs) y (pred) SSE TSS1.20 4.00 3.94 0.004 18.9402.30 5.60 6.07 0.223 4.9203.10 7.90 7.62 0.076 0.4443.40 8.00 8.21 0.042 0.0074.00 10.10 9.37 0.533 1.1664.60 10.40 10.53 0.018 5.0365.50 12.00 12.28 0.078 15.920

    0.974 46.432

    β1 = 1.94

    β0 = 1.61

    R2 = 0.98

  • Logistic Regression● Fit a curve to data in which the dependent variable is

    binary, or dichotomous

    Regression Curve:Sigmoid function!

    (bounded by asymptotes y=0 and y=1)

    Observations:For each value of SurvRate, the number of dots is the number of patients with that value of NewOut

  • Solution Given some event with probability p of being 1, the odds

    of that event are given by:

    The logit is the natural log of the oddslogit(p) = ln(odds) = ln (p/(1-p))

    In logistic regression, we seek a model (2 dim. here):

    That is, the logit is assumed to be linearly related to the independent variables, and we solve an ordinary (linear) regression!

    odds = p / (1–p)

    logit p =β0+β1 X

  • Illustration (I)

    Age GroupNo Yes

    1 9 1 10 (20-29)2 13 2 15 (30-34)3 9 3 12 (35-39)4 10 5 15 (40-44)5 7 6 13 (45-49)6 3 5 8 (50-54)7 4 13 17 (55-59)8 2 8 10 (60-69)

    Total 57 43 100

    Coronary Heart Disease Total

  • Illustration (II)

    Age Group p(CHD)=1 odds log odds #occ1 0.1000 0.1111 -2.1972 102 0.1333 0.1538 -1.8718 153 0.2500 0.3333 -1.0986 124 0.3333 0.5000 -0.6931 155 0.4615 0.8571 -0.1542 136 0.6250 1.6667 0.5108 87 0.7647 3.2500 1.1787 178 0.8000 4.0000 1.3863 10

  • Illustration (III)X (AG) Y (log odds) X^2 XY #occ1 -2.1972 1.0000 -2.1972 102 -1.8718 4.0000 -3.7436 153 -1.0986 9.0000 -3.2958 124 -0.6931 16.0000 -2.7726 155 -0.1542 25.0000 -0.7708 136 0.5108 36.0000 3.0650 87 1.1787 49.0000 8.2506 178 1.3863 64.0000 11.0904 10

    448 -37.6471 2504.0000 106.3981 100

    Note: the sums reflect the number of occurrences(Sum(X) = X1.#occ(X1)+…+X8.#occ(X8), etc.)

  • Illustration (IV)● Results from regression:

    – β0 = -2.856 and β1 = 0.5535Age Group p(CHD)=1 est. p

    1 0.1000 0.09092 0.1333 0.14823 0.2500 0.23234 0.3333 0.34485 0.4615 0.47786 0.6250 0.61427 0.7647 0.73468 0.8000 0.8280

    SSE 0.0028TSS 0.5265

    R2 0.9946

  • Recovering Probabilities

    ln p1− p

    =β0β1 X

    which gives p as a sigmoid function!

    p1− p

    =eβ 0β1 X

    p= eβ 0β1 X

    1eβ0β1 X

    = 11e

    − β0 β1 X

  • Discussion● Regression is a powerful data mining

    technique It provides prediction It offers insight into the relative power of each

    independent variable Technique of choice in medicine and

    social sciences

  • Multi-layer Feed-forward NN

    i

    i

    i

    i

    j

    k

    k

    k

  • Training Error

    ● We define the training error of a hypothesis, or weight vector, over a set of data D, by:

    E w =12 ∑d∈D

    td−od 2

    Which we will seek to minimize

  • Non-linear Activation

    ● Introduce non-linearity with sigmoid function:

    net=∑i=1

    n

    w i x i

    out= 11+e−net

    Differentiable (required for gradient-descent) Most unstable in the middle

    d out d net

    =out .1−out

  • Backpropagation

    ● Initialize all weights randomly● Repeat

    – Present a training instance– Compute error δk of output units– For each hidden layer

    ● Compute error δj using error from next layer– Update all weights: wijwij+Δwij (with Δwij=ηOiδj)

    ● Until (E < CriticalError)

  • Network Equations

    Δw ij=ηOiδ jOutput where δ j Hidden

    T j−O j f' net j ∑

    kδ k w jk f

    ' net j

    f ' net j =O j 1−O j

  • Illustration (I)

    ● Consider a simple network composed of:– 3 inputs: a, b, c– 1 hidden node: h– 2 outputs: q, r

    ● Assume η=0.5, all weights are initialized to 0.2 and weight updates are incremental

    ● Consider the training set:– 1 0 1 – 0 1– 0 1 1 – 1 1

    ● 4 iterations over the training set

  • Illustration (II)a b c W a-h W b-h W c-h h W h-q W h-r Out q Out r Target q Target r

    0.2 0.2 0.2 0.2 0.21 0 1 0.2 0.2 0.2 0.6 0.2 0.2 0.53 0.53 0 1 -0.13 -0.04 0.12 0.04 0 0

    update weights 0.2 0.2 0.2 0.16 0.240 1 1 0.2 0.2 0.2 0.6 0.16 0.24 0.52 0.54 1 1 0.12 0.04 0.12 0.03 0.01 0

    update weights 0.2 0.21 0.21 0.2 0.271 0 1 0.2 0.21 0.21 0.6 0.2 0.27 0.53 0.54 0 1 -0.13 -0.04 0.11 0.03 0 0

    update weights 0.2 0.21 0.21 0.16 0.30 1 1 0.2 0.21 0.21 0.6 0.16 0.3 0.52 0.55 1 1 0.12 0.04 0.11 0.03 0.01 0

    update weights 0.2 0.21 0.21 0.19 0.341 0 1 0.2 0.21 0.21 0.6 0.19 0.34 0.53 0.55 0 1 -0.13 -0.04 0.11 0.03 0 0

    update weights 0.2 0.21 0.21 0.15 0.370 1 1 0.2 0.21 0.21 0.6 0.15 0.37 0.52 0.56 1 1 0.12 0.04 0.11 0.03 0.01 0

    update weights 0.2 0.22 0.22 0.19 0.41 0 1 0.2 0.22 0.22 0.6 0.19 0.4 0.53 0.56 0 1 -0.13 -0.04 0.11 0.03 0 0

    update weights 0.2 0.22 0.22 0.15 0.440 1 1 0.2 0.22 0.22 0.61 0.15 0.44 0.52 0.57 1 1 0.12 0.04 0.11 0.03 0.02 0

    update weights 0.2 0.23 0.23 0.19 0.47

    δ q Δ W h-q δ r Δ W h-r δ h Δ W a-hintialization

  • Discussion

    ● 3-layer BPNN: Universal Function Approximators

    ● Require many passes over the data● Potential for massive parallelism● Convergence to global minimum not guaranteed

    – Use momentum term: ● Keep moving through small local (global!) minima or along

    flat regions– Use incremental/stochastic version of the algorithm– Train multiple networks with different starting weights

    ● Select best on hold-out validation set● Combine outputs (e.g., weighted average)

    Δw ijn =ηOiδ j+αΔwij n−1

  • Check Your Understanding● young,myope,no,reduced,none● young,myope,no,normal,soft● young,myope,yes,reduced,none● young,myope,yes,normal,hard● young,hypermetrope,no,reduced,none● young,hypermetrope,no,normal,soft● young,hypermetrope,yes,reduced,none● young,hypermetrope,yes,normal,hard● pre-presbyopic,myope,no,reduced,none● pre-presbyopic,myope,no,normal,soft● pre-presbyopic,myope,yes,reduced,none● pre-presbyopic,myope,yes,normal,hard● pre-presbyopic,hypermetrope,no,reduced,none● pre-presbyopic,hypermetrope,no,normal,soft● pre-presbyopic,hypermetrope,yes,reduced,none● pre-presbyopic,hypermetrope,yes,normal,none● presbyopic,myope,no,reduced,none● presbyopic,myope,no,normal,none● presbyopic,myope,yes,reduced,none● presbyopic,myope,yes,normal,hard● presbyopic,hypermetrope,no,reduced,none● presbyopic,hypermetrope,no,normal,soft● presbyopic,hypermetrope,yes,reduced,none● presbyopic,hypermetrope,yes,normal,none

    Attribute Information:Attribute Information: 1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic 2. spectacle prescription: (1) myope, (2) hypermetrope2. spectacle prescription: (1) myope, (2) hypermetrope 3. astigmatic: (1) no, (2) yes3. astigmatic: (1) no, (2) yes 4. tear production rate: (1) reduced, (2) normal4. tear production rate: (1) reduced, (2) normal Class Distribution:Class Distribution: 1. hard contact lenses: 41. hard contact lenses: 4 2. soft contact lenses: 52. soft contact lenses: 5 3. no contact lenses: 153. no contact lenses: 15

    Apply all of the above predictive Apply all of the above predictive modelling algorithms to this dataset modelling algorithms to this dataset

  • Summarization

    ● A compact description for a subset of the data

    ● Retrospective analysis

    ● Techniques:– Statistics– Information theory– Online analytical processing (OLAP)

  • Illustration● Average down time of all plant

    equipments in a given month● Total income generated by each sales

    representative per region per year● Proportion of each type of surgical

    procedure performed by gender and ethnicity

  • Estimating Means

    ● What is the maximum likelihood hypothesis for the mean of a single Normal distribution given observed instances from it?– The hypothesis that minimizes the SSE, which in this

    case, happens to be the sample mean:

    µML=argminµ

    ∑i=1

    n

    x i−µ 2 =1n∑i=1

    n

    x i

    What if there are k means to estimate?

  • Using Hidden Variables

    ● We have k hidden variables● Each training instance is extended from xi

    to < xi, zi1, zi2, …, zik>:– xi is the observed instance– zij are the hidden variables– zij = 1 if xi was generated by the jth Gaussian

    and 0 otherwise

  • k-Means Algorithm

    ● Initialization:– Set h = where the µi ’s are arbitrary values

    ● Step 1:– Calculate C[zij] for each hidden variable zij, assuming the

    current hypothesis h: 1 if xi is closest to µj, 0 otherwise● Step 2:

    – Calculate a new maximum likelihood hypothesis h’ = , assuming the value taken on by each zij is C[zij] as calculated in Step 1: mean of the xi's for which C[zij]=1

    – Replace h by h’– If stopping condition is not met, go to Step 1

  • Intuitively

    1. Pick a number (k) of cluster centers (at random2. Assign every item to its nearest cluster center

    (e.g., using Euclidean distance)3. Move each cluster center to the mean of its

    assigned items4. Repeat steps 2 and 3 until convergence (i.e.,

    change in cluster assignments less than a threshold)

  • Illustration (I)

    k1

    k2

    k3

    X

    Y

    Pick 3 initialclustercenters(randomly)

  • Illustration (II)

    k1

    k2

    k3

    X

    Y

    Assigneach pointto the closestclustercenter

  • Illustration (III)

    X

    Y

    Moveeach cluster centerto the meanof each cluster

    k1

    k2

    k2

    k1

    k3

    k3

  • Illustration (IV)

    X

    Y

    three points have been “re-labelled”

    k1

    k3k2

    Reassignpoints closest to a different new cluster center

  • Illustration (V)

    X

    Yre-compute cluster means

    k1

    k3k2

  • Illustration (VI)

    X

    Y

    move cluster centers to cluster means

    no more changes: Done!

    k2

    k1

    k3

  • Discussion Simple Items automatically assigned to clusters Requires some distance function Must pick number of clusters beforehand

    Result can vary significantly depending on initial choice of seeds

    To increase chances of finding global optimum: restart with different random seeds

    Sensitive to outliers

  • COBWEB Overview● Symbolic approach to category formation● Uses global quality metric to determine

    number of clusters, depth of hierarchy, and category membership of new instances

    ● Categories are probabilistic● Algorithm is incremental

  • Predictability

    ● P(Fi=v

    ij | C

    k) is called predictability

    – Probability that an object has value vij for feature F

    i given that the object belongs to

    category Ck

    – The greater this probability, the more likely two objects in a category share the same features

  • Predictiveness

    ● P(Ck | F

    i=v

    ij) is called predictiveness

    – Probability with which an object belongs to category C

    k given that it has value v

    ij for

    feature Fi

    – The greater this probability, the less likely objects not in the category will have those feature values.

  • Category Utility

    ● P(Fi=v

    ij) serves as a weight

    – Ensures that frequently-occurring feature values exert a stronger influence on the evaluation

    ● CU maximizes the potential for inferring information while maximizing intra-class similarity and inter-class differences.

    CU=∑k∑j∑i

    PF i=v ij∣C k P C k∣F i=v ijP F i=v ij

  • Tree Representation● Each node stores:

    – Its probability of occurrence, P(Ck) (= num. instances at node / total num. instances)

    – All possible values of every feature observed in the instances, and for each such value, its predictability.

    – Predictiveness, computed using Bayes rule ● Leaf nodes correspond to observed

    instances● All links are “is-a” links

  • COBWEB Algorithm● Function Cobweb(Node, Inst)

    – If Node is leaf● Create 2 children, L

    1 and L

    2, of Node

    ● Set probabilities of L1 to those of Node

    ● Initialize probabilities of L2 to those of Inst

    ● Add Inst to Node, updating Node's probabilities– Else

    ● Add Inst to Node, updating Node's probabilities● For each child C of Node, compute CU of taxonomy obtained by placing Inst in C● Compute

    – S1 = score of best categorization C1 – S2 = score of next best categorization C2– S3 = score of placing Inst in new category– S4 = score of merging C1 and C2– S5 = score of splitting C1

    ● Case:– S1 is best score: Cobweb(C1, Inst)– S3 is best score: Initialize new category's probabilities to those of Inst– S4 is best score: Let Cm be the result of merging C1 and C2; Cobweb(Cm, Inst)– S5 is best score: Split C1; Cobweb(Node, Inst)– Default: Cobweb(C2, Inst)

  • Illustration (I)● Assume 2 attributes only

    – Color: Blue, Red– Shape: Triangle, Square

    ● Examples:– Blue Triangle– Blue Square– Red Triangle– Blue Triangle

  • Illustration (II)

  • Illustration (III)

  • Illustration (IV)

  • Illustration (V)● Merge C1and C2

    – Same as having only root node: CU=10/9● Split C1

    – Same as creating new category: CU=2● Hence:

    – S1=5/3– S2=4/3– S3=2– S4=10/9– S5=2

  • Illustration (VI)● Highest score is S3

    – Create a new category● Repeat process for next example

    – Blue Triangle– Results:

    ● S1=2● S2=7/4● S3=2● S4=7/6● S5=2

    – Highest score is S1 (as expected)

  • Illustration (VII)● Emerging clustering:

    – C1: Blue Triangle– C2: Blue Square– C3: Red Triangle

    ● Seems reasonable (?)

    http://www-ai.cs.uni-dortmund.de/kdnet/auto?self=$81d91eaae317b2bebb

  • Discussion● Nice probabilistic model with no

    parameters set a priori● Only handles nominal features

    (CLASSIT extends to numerical)● Sensitive to order of presentation of

    instances● Retains each instance, which may cause

    problems with noisy data

  • CL Overview● Sub-symbolic approach● Weight vectors serve as prototypes

    – The more similar the input vector is to the weight vector, the higher the net is

    – Weights “move about” until they sit in the “middle” of a cluster

    ● Simple model: binary inputs and single layer network– Output nodes implement winner-take-all policy– Lateral inhibition enforced between pairs of outputs– Every input is connected to every output with an

    adjustable, real-valued weight (sum of weights for each unit = 1)

  • CL Algorithm● Let

    – n: number of inputs– xi: value (0 or 1) of input i– – a: number of inputs set to 1– g: learning rate

    ● Initialize weights randomly● For each training instance

    net j=∑i=1nwij x i

    wij=g x ia

    -wij if j=argmaxk

    net k and 0 otherwise

  • Illustration (I)● Assume:

    – 4 Boolean inputs– 2 clusters:

    ● .1 .3 .1 .2● .2 .1 .4 .1

    – g = 0.4● Training examples

    – 1100– 0011– 1110– 1001

    C1 C2

    note: weights do not sum to 1 here

  • Illustration (II)● 1100

    – C1: 1x0.1+1x0.3+0x0.1+0x0.2 = 0.4– C2: 1x0.2+1x0.1+0x0.4+0x0.1 = 0.3– C1 wins

    ● .26 .38 .06 .12● .2 .1 .4 .1

    ● 0011– C1: 0x0.26+0x0.38+1x0.06+1x0.12 = 0.18– C2: 0x0.2+0x0.1+1x0.4+1x0.1 = 0.5– C2 wins

    ● .26 .38 .06 .12● .12 .06 .44 .26

  • Illustration (III)● 1110

    – C1: 1x.26+1x.38+1x.06+0x.12 = 0.7– C2: 1x.12+1x.06+1x.44+0x.26 = 0.62– C1 wins

    ● .29 .36 .17 .07● .12 .06 .44 .26

    ● 1001– C1: 1x.29+0x.36+0x.17+1x.07 = 0.36– C2: 1x.12+0x.06+0x.44+1x.26 = 0.38– C2 wins

    ● .29 .36 .17 .07● .27 .04 .26 .36

  • Illustration (IV)● 1100

    – C1: 1x0.29+1x0.36+0x0.17+0x0.07 = 0.65– C2: 1x0.27+1x0.04+0x0.26+0x0.36 = 0.31– C1 wins

    ● .37 .42 .1 .04● .27 .04 .26 .36

    ● 0011– C1: 0x0.37+0x0.42+1x0.1+1x0.04 = 0.14– C2: 0x0.27+0x0.04+1x0.26+1x0.36 = 0.62– C2 wins

    ● .37 .42 .1 .04● .16 .02 .36 .42

  • Illustration (V)● 1110

    – C1: 1x.37+1x.42+1x.1+0x.04 = 0.89– C2: 1x.16+1x.02+1x.36+0x.42 = 0.54– C1 wins

    ● .36 .39 .19 .02● .16 .02 .36 .42

    ● 1001– C1: 1x.36+0x.39+0x.19+1x.02 = 0.38– C2: 1x.16+0x.02+0x.36+1x.42 = 0.58– C2 wins

    ● .36 .39 .19 .02● .23 .15 .35 .25

  • Illustration (VI)● 1100

    – C1: 1x0.36+1x0.39+0x0.19+0x0.02 = 0.75– C2: 1x0.23+1x0.15+0x0.35+0x0.25 = 0.38– C1 wins

    ● .42 .43 .11 .01● .23 .15 .35 .25

    ● 0011– C1: 0x0.42+0x0.43+1x0.11+1x0.01 = 0.12– C2: 0x0.23+0x0.15+1x0.35+1x0.25 = 0.6– C2 wins

    ● .42 .43 .11 .01● .14 .09 .41 .35

  • Illustration (VII)● Emerging clustering

    – C1: 1100, 1110– C2: 0011, 1001

    ● Seems reasonable (?)

    http://www.psychology.mcmaster.ca/4i03/demos/competitive1-demo.html

  • Discussion● Number of classes set a priori (can be

    extended by adding a new node if all of the existing ones produce low activation)

    ● Learning rate is a parameter (if too large, then priority to early instances, may get stuck on early opinions)

    ● No measure of goodness for the resulting classification.

    ● Easily extended to arbitrary inputs (use metric for Net)

  • Terminology

    Item

    Itemset

    Transaction

  • Association Rules

    ● Let U be a set of items and let X, Y U, with X Y =

    ● An association rule is an expression of the form X Y, whose meaning is:– If the elements of X occur in some context,

    then so do the elements of Y

  • Quality Measures

    ● The following statistical quantities are relevant to association rule mining:– support(X)

    ● Pr(X) count(X) / T– support(X Y) (or support(X Y))

    ● Pr(X Y) = count(X Y) / T– confidence(X Y)

    ● Pr(Y | X) = count(X Y) / count(X)– lift(X Y)

    ● confidence(X Y) / support(Y)

    count(X): number of transactions where all elements of X appearT: total number of transactions considered

  • Learning Associations

    ● The purpose of association rule learning is to find “interesting” rules, i.e., rules that meet the following two user-defined conditions:– support(X Y) MinSupport– confidence(X Y) MinConfidence

  • Itemsets

    ● Frequent itemset– An itemset whose support is greater than

    some user-specified minimum support (denoted Lk where k is the size of the itemset)

    ● Candidate itemset– A potentially frequent itemset (denoted Ck

    where k is the size of the itemset)

  • Basic Idea

    ● Generate all frequent itemsets satisfying the condition on minimum support

    ● Build all possible rules from these itemsets and check them against the condition on minimum confidence

    ● All the rules above the minimum confidence threshold are returned for further evaluation

  • AprioriAll Algorithm● L1 ● For each item Ij I

    – count({Ij}) = | {Ti : Ij Ti} |– If count({Ij}) MinSupport x m

    ● L1 L1 {({Ij}, count({Ij})}● k 2● While Lk-1

    Lk For each (l1, count(l1)) Lk-1

    ● For each (l2, count(l2)) Lk-1 If (l1 = {j1, …, jk-2, x} l2 = {j1, …, jk-2, y} x y)

    ● l {j1, …, jk-2, x, y}● count(l) | {Ti : l Ti } |● If count(l) MinSupport x m

    Lk Lk {(l, count(l))}– k k + 1

    Return L1 L2… Lk-1

  • IllustrationLoan Applications

    Client # Credit History Debt Level Collateral Income Level Risk Level1 Bad High None Low High2 Unknown High None Medium High3 Unknown Low None Medium Moderate4 Unknown Low None Low High5 Unknown Low None High Low6 Unknown Low Adequate High Low7 Bad Low None Low High8 Bad Low Adequate High Moderate9 Good Low None High Low

    10 Good High Adequate High Low11 Good High None Low High12 Good High None Medium Moderate13 Good High None High Low14 Bad High None Medium High

  • Running Apriori (I)

    ● Items:– (CH=Bad, .29) (CH=Unknown, .36) (CH=Good, .36)– (DL=Low, .5) (DL=High, .5)– (C=None, .79) (C=Adequate, .21)– (IL=Low, .29) (IL=Medium, .29) (IL=High, .43)– (RL=High, .43) (RL=Moderate, .21) (RL=Low, .36)

    ● Choose MinSupport=.4 and MinConfidence=.8

  • Running Apriori (II)

    ● L1 = {(DL=Low, .5); (DL=High, .5); (C=None, .79); (IL=High, .43); (RL=High, .43)}

    ● L2 = {(DL=High + C=None, .43)}

    ● L3 = {}

  • Running Apriori (III)

    ● Two possible rules:– DL = High C = None (A)– C = None DL = High (B)

    ● Confidences:– Conf(A) = .86 Retain– Conf(B) = .54 Ignore

  • Discussion

    ● A “true” data mining algorithm● Despite its popularity, real reported applications are few● Easy to implement with a sparse matrix and simple

    sums● Computationally expensive (actual run-time depends on

    MinSupport and, in the worst-case, the time complexity is O(2n))

    ● Not strictly an associations learner (induces rules, which are inherently unidirectional; alternatives, e.g., GRI)

    ● Extendible for sequence learning