Introduction to Data Mining Modelling Algorithmsdml.cs.byu.edu/~cgc/docs/atdm/Readings/DMAlgOverview.pdf · 2008. 2. 25. · Information Gain Entropy: – Let S be a set examples

Introduction to Data Mining Modelling Algorithms

Aim and Scope● Brief introduction to main concepts and

modelling algorithms for DM applications● Enough background to use existing

commercial/freeware tools on simple tasks● Not exhaustive: only representatives of

each class of models are presented

Data Mining Approaches● Predictive Modeling

– Decision trees: ID3– Instance-based: k-NN– Probabilistic: NB, Linear/Logistic Regression– Connectionist: Backpropagation

● Descriptive Modeling– Summarization– Clustering: k-Means, COBWEB, Competitive Learning– Association Rule: Apriori

● Genetic Algorithms

Decision Tree

● Structure:– Internal nodes tests on some property– Branches values of associated property– Leaf nodes classifications– Classification: traverse tree from root to a

leaf● Goal: construct a decision tree that allows

the classification of objects from examples– Occam's Razor

ID3 Algorithm

● Function ID3(Example-set, Properties)– If all elements in Example-set are in the same class, then return

a leaf node labeled with that class– Else if Properties is empty, then return a leaf node labeled with

the majority class in Example-set– Else

● Select P from Properties so as to maximize information gain● Remove P from Properties● Make P the root of the current tree● For each value V of P

– Create a branch of the current tree labeled by V– Partition_V Elements of Example-set with value V for P– ID3(Partition_V, Properties)– Attach result to branch V

Information Gain

● Entropy:– Let S be a set examples from c classes

Entropy S =∑i=1

c

−p i log2 p i where pi is the proportion of examples

of S belonging to class i (Note, we define 0log0=0)

The smaller the entropy, the “purer” the partitionThe smaller the entropy, the “purer” the partition Information Gain:Information Gain:

Let p be a property with n outcomesLet p be a property with n outcomes Partitioning S based on P results in a gain of:Partitioning S based on P results in a gain of:

Gain S,p =Entropy S −∑i= 1

n ∣S i∣∣S∣

Entropy S i where Si is the subset of S for which

property p has its ith value

Illustrative Training SetRisk Assessment for Loan Applications

Client # Credit History Debt Level Collateral Income Level RISK LEVEL1 Bad High None Low HIGH2 Unknown High None Medium HIGH3 Unknown Low None Medium MODERATE4 Unknown Low None Low HIGH5 Unknown Low None High LOW6 Unknown Low Adequate High LOW7 Bad Low None Low HIGH8 Bad Low Adequate High MODERATE9 Good Low None High LOW

10 Good High Adequate High LOW11 Good High None Low HIGH12 Good High None Medium MODERATE13 Good High None High LOW14 Bad High None Medium HIGH

ID3 Example (I)

1) Choose Income as root of tree.

Income High

Medium Low

1,4,7,11 2,3,12,14 5,6,8,9,10,13 4) 2) 3)

2) All examples are in the same class, HIGH. Return Leaf Node.

Income High

Medium Low

HIGH 2,3,12,14 5,6,8,9,10,13

3) Choose Debt Level as root of subtree.

Debt Level High Low

3 2,12,14 3b) 3a)

3a) All examples are in the same class, MODERATE. Return Leaf node.

Debt Level High Low

MODERATE 2,12,14 3b)

ID3 Example (II)

3b) Choose Credit History as root of subtree.

Credit History Good

Bad Unknown

2 14 12 3b’’’)

3b’’) 3b’)

3b’-3b’’’) All examples are in the same class. Return Leaf nodes.

Credit History Good

Bad Unknown

HIGH HIGH MODERATE

4) Choose Credit History as root of subtree.

Credit History Good

Bad Unknown

5,6 8 9,10,13 4c) 4b) 4a)

4a-4c) All examples are in the same class. Return Leaf nodes.

Credit History Good

Bad Unknown

LOW MODERATE LOW

ID3 Example (III) Attach subtrees at appropr iate places.

Income

High Medium Low

HIGH Debt Level High Low

MODERA TE Credit History

Good Bad Unknown

HIGH HIGH MODERA TE

Credit History Good

Bad Unknown

LOW MODERA TE LOW

Discussion

● ID3 handles only discrete attributes: extensions to numerical attributes have been proposed, the most famous being C4.5/C5.0

● Experience shows that decision trees tend to produce very good results on many problems

● Trees are most attractive when end users want interpretable knowledge from their data

● Overfitting likely: use pruning

Instance-based

● Does not build an explicit model● Training data is stored and, when a new

query instance is encountered, a set of similar, related instances is retrieved from memory and used to classify the new query instance

● Lazy learning: simply compute the classification of each new query instance as needed

k-NN Algorithm

● For each training instance t=(x, f(x))– Add t to the set Tr_instances

● Given a query instance q to be classified– Let x1, …, xk be the k instances in Tr_instances nearest to q– Return

Intuitively, the k-NN algorithm assigns to each new query instance the majority class among its k nearest neighbors

where V is the finite set of target class values, and δ(a,b)=1 if a=b, and 0 otherwise

f q =argmaxv∈V

∑i=1

k

δ v , f x i

Illustration

q is + under 1-NN, but – under 5-NN

q

+

++-

--

---

+-

Discussion● Works well on many practical problems and

fairly noise tolerant (depending on k)● Some issues:

– Irrelevant attributes dominate the distance function: use weighted distance function or remove irrelevant attributes

– Long query processing: implement indexing schemes– Large memory requirement: implement instance

reduction strategies– Non-Euclidean/heterogeneous spaces:

extend/design adequate distance measures

Probabilistic Learning

● We are often interested in determining the best hypothesis from some space H, given the observed training data D.

● One way to specify what is meant by the best hypothesis is to say that we demand the most probable hypothesis, given the data D together with any initial knowledge about the prior probabilities of the various hypotheses in H.

Remark

● Instead of finding the most probable hypothesis given the training data, often, it is the following related question that is most relevant:– Which is the most probable classification of

the new query instance given the training data?

● In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.

Bayes Optimal Classification

● If the possible classification of the new instance can take on any value vj from some set V, then the probability P(vj|D) that the correct classification for the new instance is vj , is given by:

P v j∣D =∑h i∈H

P v j∣h i P hi∣D

Then, the optimal classification of the new instance is:

Nice, but impractical for large hypothesis spaces

argmaxv j∈V

∑hi∈H

P v j∣hi P h i∣D

Naïve Bayes Learning

● A practical Bayesian learning method● Applies when instances are conjunctions of

attribute values and the target takes its values from some finite set V

● Consists in assigning to a new query instance the most probable target value, vMAP, given the attribute values a1, …, an that describe the instance, i.e.,

vMAP=argmaxv j∈V

P v j∣a1 , , an

NB Algorithm

● Using Bayes theorem, this can be reformulated as:

Making the further simplifying assumption that attribute values are conditionally independent given the target value, one can write the conjunctive conditional probability as a product of simple conditional probabilities, producing the algorithm:

ReturnReturn argmaxv j∈V

P v j ∏i=1

n

P a i∣v j

argmaxv j∈V

P a1 ,... , an∣v j P v j P a1 ,... , an

Illustrative Training SetRisk Assessment for Loan Applications

Client # Credit History Debt Level Collateral Income Level RISK LEVEL1 Bad High None Low HIGH2 Unknown High None Medium HIGH3 Unknown Low None Medium MODERATE4 Unknown Low None Low HIGH5 Unknown Low None High LOW6 Unknown Low Adequate High LOW7 Bad Low None Low HIGH8 Bad Low Adequate High MODERATE9 Good Low None High LOW

10 Good High Adequate High LOW11 Good High None Low HIGH12 Good High None Medium MODERATE13 Good High None High LOW14 Bad High None Medium HIGH

NB ExampleCredit History Risk Level

High Moderate Low High Moderate LowUnknown 0.33 0.33 0.40 6 3 5 14Bad 0.50 0.33 0.00 0.43 0.21 0.36 1.00Good 0.17 0.33 0.60

Debt Level High Moderate LowHigh 0.67 0.33 0.40 Consider the query instance: (Bad, Low, Adequate, Medium)Low 0.33 0.67 0.60

High 0.00%Collateral High Moderate Low Moderate 1.06%

None 1.00 0.67 0.60 Low 0.00%Adequate 0.00 0.33 0.40 Prediction: Moderate

Income Level High Moderate LowHigh 0.00 0.33 1.00 Consider the query instance: (Bad, High, None, Low) - SeenMedium 0.33 0.67 0.00Low 0.67 0.00 0.00 High 9.52%

Moderate 0.00%Low 0.00%

Prediction: High

Discussion

Whenever the assumption of conditional independence is satisfied, NB classification is optimal

NB is inherently incremental NB estimates probabilities by frequencies Assume P(X=x|Y=y)=0.05 and the training set is s.t. ny=5.

Then, it is highly probable that nx|y=0. The fraction is thus an underestimate of the probability. It will “dominate” the NB classifier for all new queries with X=x.

Use m-estimate: replace nx|y/ny by (nx|y+mp)/(ny+m), where p is our prior estimate of the probability we wish to determine and m is a constant (typically, p=1/(number of values of X) and m acts as a weight, similar to adding m virtual instances distributed according to p)

Linear Regression● Fit a linear model to data where the dependent

variable is continuous:

● Given a set of points (Xi,Yi), we wish to find a linear function (or line in 2 dimensions) that “goes through” these points.

● In general, the points are not exactly aligned:– Find line that best fits the points

Y=β0+β1 X 1+β2 X 2+K+βn X n+e

Sum-squared Error (SSE)

The smaller the SSE, the better the fitThe smaller the SSE, the better the fit Hence, linear regression attempts to Hence, linear regression attempts to

minimize SSE, or similarly to maximize Rminimize SSE, or similarly to maximize R22

SSE=∑y yobserved− y predicted

2

TSS=∑y y observed−yobserved

2R2=1−SSE

TSS

Analytical Solution

β0=∑ y−β1∑ xn

β1=n∑ xy−∑ x∑ yn∑ x2−∑ x 2

In 2 dimensions:

In n>2 dimensions: matrix arithmetic

Illustration

223.6195.3158.0024.1066.0030.2512.005.5047.8421.1610.404.6040.4016.0010.104.0027.2011.568.003.4024.499.617.903.1012.885.295.602.304.801.444.001.20xyx^2yx

(target: y=2x+1.5)x y (obs) y (pred) SSE TSS1.20 4.00 3.94 0.004 18.9402.30 5.60 6.07 0.223 4.9203.10 7.90 7.62 0.076 0.4443.40 8.00 8.21 0.042 0.0074.00 10.10 9.37 0.533 1.1664.60 10.40 10.53 0.018 5.0365.50 12.00 12.28 0.078 15.920

0.974 46.432

β1 = 1.94

β0 = 1.61

R2 = 0.98

Logistic Regression● Fit a curve to data in which the dependent variable is

binary, or dichotomous

Regression Curve:Sigmoid function!

(bounded by asymptotes y=0 and y=1)

Observations:For each value of SurvRate, the number of dots is the number of patients with that value of NewOut

Solution Given some event with probability p of being 1, the odds

of that event are given by:

The logit is the natural log of the oddslogit(p) = ln(odds) = ln (p/(1-p))

In logistic regression, we seek a model (2 dim. here):

That is, the logit is assumed to be linearly related to the independent variables, and we solve an ordinary (linear) regression!

odds = p / (1–p)

logit p =β0+β1 X

Illustration (I)

Age GroupNo Yes

1 9 1 10 (20-29)2 13 2 15 (30-34)3 9 3 12 (35-39)4 10 5 15 (40-44)5 7 6 13 (45-49)6 3 5 8 (50-54)7 4 13 17 (55-59)8 2 8 10 (60-69)

Total 57 43 100

Coronary Heart Disease Total

Illustration (II)

Age Group p(CHD)=1 odds log odds #occ1 0.1000 0.1111 -2.1972 102 0.1333 0.1538 -1.8718 153 0.2500 0.3333 -1.0986 124 0.3333 0.5000 -0.6931 155 0.4615 0.8571 -0.1542 136 0.6250 1.6667 0.5108 87 0.7647 3.2500 1.1787 178 0.8000 4.0000 1.3863 10

Illustration (III)X (AG) Y (log odds) X^2 XY #occ1 -2.1972 1.0000 -2.1972 102 -1.8718 4.0000 -3.7436 153 -1.0986 9.0000 -3.2958 124 -0.6931 16.0000 -2.7726 155 -0.1542 25.0000 -0.7708 136 0.5108 36.0000 3.0650 87 1.1787 49.0000 8.2506 178 1.3863 64.0000 11.0904 10

448 -37.6471 2504.0000 106.3981 100

Note: the sums reflect the number of occurrences(Sum(X) = X1.#occ(X1)+…+X8.#occ(X8), etc.)

Illustration (IV)● Results from regression:

– β0 = -2.856 and β1 = 0.5535Age Group p(CHD)=1 est. p

1 0.1000 0.09092 0.1333 0.14823 0.2500 0.23234 0.3333 0.34485 0.4615 0.47786 0.6250 0.61427 0.7647 0.73468 0.8000 0.8280

SSE 0.0028TSS 0.5265

R2 0.9946

Recovering Probabilities

ln p1− p

=β0β1 X

which gives p as a sigmoid function!

p1− p

=eβ 0β1 X

p= eβ 0β1 X

1eβ0β1 X

= 11e

− β0 β1 X

Discussion● Regression is a powerful data mining

technique It provides prediction It offers insight into the relative power of each

independent variable Technique of choice in medicine and

social sciences

Multi-layer Feed-forward NN

i

i

i

i

j

k

k

k

Training Error

● We define the training error of a hypothesis, or weight vector, over a set of data D, by:

E w =12 ∑d∈D

td−od 2

Which we will seek to minimize

Non-linear Activation

● Introduce non-linearity with sigmoid function:

net=∑i=1

n

w i x i

out= 11+e−net

Differentiable (required for gradient-descent) Most unstable in the middle

d out d net

=out .1−out

Backpropagation

● Initialize all weights randomly● Repeat

– Present a training instance– Compute error δk of output units– For each hidden layer

● Compute error δj using error from next layer– Update all weights: wijwij+Δwij (with Δwij=ηOiδj)

● Until (E < CriticalError)

Network Equations

Δw ij=ηOiδ jOutput where δ j Hidden

T j−O j f' net j ∑

kδ k w jk f

' net j

f ' net j =O j 1−O j

Illustration (I)

● Consider a simple network composed of:– 3 inputs: a, b, c– 1 hidden node: h– 2 outputs: q, r

● Assume η=0.5, all weights are initialized to 0.2 and weight updates are incremental

● Consider the training set:– 1 0 1 – 0 1– 0 1 1 – 1 1

● 4 iterations over the training set

Illustration (II)a b c W a-h W b-h W c-h h W h-q W h-r Out q Out r Target q Target r

0.2 0.2 0.2 0.2 0.21 0 1 0.2 0.2 0.2 0.6 0.2 0.2 0.53 0.53 0 1 -0.13 -0.04 0.12 0.04 0 0

update weights 0.2 0.2 0.2 0.16 0.240 1 1 0.2 0.2 0.2 0.6 0.16 0.24 0.52 0.54 1 1 0.12 0.04 0.12 0.03 0.01 0

update weights 0.2 0.21 0.21 0.2 0.271 0 1 0.2 0.21 0.21 0.6 0.2 0.27 0.53 0.54 0 1 -0.13 -0.04 0.11 0.03 0 0

update weights 0.2 0.21 0.21 0.16 0.30 1 1 0.2 0.21 0.21 0.6 0.16 0.3 0.52 0.55 1 1 0.12 0.04 0.11 0.03 0.01 0

update weights 0.2 0.21 0.21 0.19 0.341 0 1 0.2 0.21 0.21 0.6 0.19 0.34 0.53 0.55 0 1 -0.13 -0.04 0.11 0.03 0 0

update weights 0.2 0.21 0.21 0.15 0.370 1 1 0.2 0.21 0.21 0.6 0.15 0.37 0.52 0.56 1 1 0.12 0.04 0.11 0.03 0.01 0

update weights 0.2 0.22 0.22 0.19 0.41 0 1 0.2 0.22 0.22 0.6 0.19 0.4 0.53 0.56 0 1 -0.13 -0.04 0.11 0.03 0 0

update weights 0.2 0.22 0.22 0.15 0.440 1 1 0.2 0.22 0.22 0.61 0.15 0.44 0.52 0.57 1 1 0.12 0.04 0.11 0.03 0.02 0

update weights 0.2 0.23 0.23 0.19 0.47

δ q Δ W h-q δ r Δ W h-r δ h Δ W a-hintialization

Discussion

● 3-layer BPNN: Universal Function Approximators

● Require many passes over the data● Potential for massive parallelism● Convergence to global minimum not guaranteed

– Use momentum term: ● Keep moving through small local (global!) minima or along

flat regions– Use incremental/stochastic version of the algorithm– Train multiple networks with different starting weights

● Select best on hold-out validation set● Combine outputs (e.g., weighted average)

Δw ijn =ηOiδ j+αΔwij n−1

Check Your Understanding● young,myope,no,reduced,none● young,myope,no,normal,soft● young,myope,yes,reduced,none● young,myope,yes,normal,hard● young,hypermetrope,no,reduced,none● young,hypermetrope,no,normal,soft● young,hypermetrope,yes,reduced,none● young,hypermetrope,yes,normal,hard● pre-presbyopic,myope,no,reduced,none● pre-presbyopic,myope,no,normal,soft● pre-presbyopic,myope,yes,reduced,none● pre-presbyopic,myope,yes,normal,hard● pre-presbyopic,hypermetrope,no,reduced,none● pre-presbyopic,hypermetrope,no,normal,soft● pre-presbyopic,hypermetrope,yes,reduced,none● pre-presbyopic,hypermetrope,yes,normal,none● presbyopic,myope,no,reduced,none● presbyopic,myope,no,normal,none● presbyopic,myope,yes,reduced,none● presbyopic,myope,yes,normal,hard● presbyopic,hypermetrope,no,reduced,none● presbyopic,hypermetrope,no,normal,soft● presbyopic,hypermetrope,yes,reduced,none● presbyopic,hypermetrope,yes,normal,none

Attribute Information:Attribute Information: 1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic 2. spectacle prescription: (1) myope, (2) hypermetrope2. spectacle prescription: (1) myope, (2) hypermetrope 3. astigmatic: (1) no, (2) yes3. astigmatic: (1) no, (2) yes 4. tear production rate: (1) reduced, (2) normal4. tear production rate: (1) reduced, (2) normal Class Distribution:Class Distribution: 1. hard contact lenses: 41. hard contact lenses: 4 2. soft contact lenses: 52. soft contact lenses: 5 3. no contact lenses: 153. no contact lenses: 15

Apply all of the above predictive Apply all of the above predictive modelling algorithms to this dataset modelling algorithms to this dataset

Summarization

● A compact description for a subset of the data

● Retrospective analysis

● Techniques:– Statistics– Information theory– Online analytical processing (OLAP)

Illustration● Average down time of all plant

equipments in a given month● Total income generated by each sales

representative per region per year● Proportion of each type of surgical

procedure performed by gender and ethnicity

Estimating Means

● What is the maximum likelihood hypothesis for the mean of a single Normal distribution given observed instances from it?– The hypothesis that minimizes the SSE, which in this

case, happens to be the sample mean:

µML=argminµ

∑i=1

n

x i−µ 2 =1n∑i=1

n

x i

What if there are k means to estimate?

Using Hidden Variables

● We have k hidden variables● Each training instance is extended from xi

to < xi, zi1, zi2, …, zik>:– xi is the observed instance– zij are the hidden variables– zij = 1 if xi was generated by the jth Gaussian

and 0 otherwise

k-Means Algorithm

● Initialization:– Set h = where the µi ’s are arbitrary values

● Step 1:– Calculate C[zij] for each hidden variable zij, assuming the

current hypothesis h: 1 if xi is closest to µj, 0 otherwise● Step 2:

– Calculate a new maximum likelihood hypothesis h’ = , assuming the value taken on by each zij is C[zij] as calculated in Step 1: mean of the xi's for which C[zij]=1

– Replace h by h’– If stopping condition is not met, go to Step 1

Intuitively

1. Pick a number (k) of cluster centers (at random2. Assign every item to its nearest cluster center

(e.g., using Euclidean distance)3. Move each cluster center to the mean of its

assigned items4. Repeat steps 2 and 3 until convergence (i.e.,

change in cluster assignments less than a threshold)

Illustration (I)

k1

k2

k3

X

Y

Pick 3 initialclustercenters(randomly)

Illustration (II)

k1

k2

k3

X

Y

Assigneach pointto the closestclustercenter

Illustration (III)

X

Y

Moveeach cluster centerto the meanof each cluster

k1

k2

k2

k1

k3

k3

Illustration (IV)

X

Y

three points have been “re-labelled”

k1

k3k2

Reassignpoints closest to a different new cluster center

Illustration (V)

X

Yre-compute cluster means

k1

k3k2

Illustration (VI)

X

Y

move cluster centers to cluster means

no more changes: Done!

k2

k1

k3

Discussion Simple Items automatically assigned to clusters Requires some distance function Must pick number of clusters beforehand

Result can vary significantly depending on initial choice of seeds

To increase chances of finding global optimum: restart with different random seeds

Sensitive to outliers

COBWEB Overview● Symbolic approach to category formation● Uses global quality metric to determine

number of clusters, depth of hierarchy, and category membership of new instances

● Categories are probabilistic● Algorithm is incremental

Predictability

● P(Fi=v

ij | C

k) is called predictability

– Probability that an object has value vij for feature F

i given that the object belongs to

category Ck

– The greater this probability, the more likely two objects in a category share the same features

Predictiveness

● P(Ck | F

i=v

ij) is called predictiveness

– Probability with which an object belongs to category C

k given that it has value v

ij for

feature Fi

– The greater this probability, the less likely objects not in the category will have those feature values.

Category Utility

● P(Fi=v

ij) serves as a weight

– Ensures that frequently-occurring feature values exert a stronger influence on the evaluation

● CU maximizes the potential for inferring information while maximizing intra-class similarity and inter-class differences.

CU=∑k∑j∑i

PF i=v ij∣C k P C k∣F i=v ijP F i=v ij

Tree Representation● Each node stores:

– Its probability of occurrence, P(Ck) (= num. instances at node / total num. instances)

– All possible values of every feature observed in the instances, and for each such value, its predictability.

– Predictiveness, computed using Bayes rule ● Leaf nodes correspond to observed

instances● All links are “is-a” links

COBWEB Algorithm● Function Cobweb(Node, Inst)

– If Node is leaf● Create 2 children, L

1 and L

2, of Node

● Set probabilities of L1 to those of Node

● Initialize probabilities of L2 to those of Inst

● Add Inst to Node, updating Node's probabilities– Else

● Add Inst to Node, updating Node's probabilities● For each child C of Node, compute CU of taxonomy obtained by placing Inst in C● Compute

– S1 = score of best categorization C1 – S2 = score of next best categorization C2– S3 = score of placing Inst in new category– S4 = score of merging C1 and C2– S5 = score of splitting C1

● Case:– S1 is best score: Cobweb(C1, Inst)– S3 is best score: Initialize new category's probabilities to those of Inst– S4 is best score: Let Cm be the result of merging C1 and C2; Cobweb(Cm, Inst)– S5 is best score: Split C1; Cobweb(Node, Inst)– Default: Cobweb(C2, Inst)

Illustration (I)● Assume 2 attributes only

– Color: Blue, Red– Shape: Triangle, Square

● Examples:– Blue Triangle– Blue Square– Red Triangle– Blue Triangle

Illustration (II)

Illustration (III)

Illustration (IV)

Illustration (V)● Merge C1and C2

– Same as having only root node: CU=10/9● Split C1

– Same as creating new category: CU=2● Hence:

– S1=5/3– S2=4/3– S3=2– S4=10/9– S5=2

Illustration (VI)● Highest score is S3

– Create a new category● Repeat process for next example

– Blue Triangle– Results:

● S1=2● S2=7/4● S3=2● S4=7/6● S5=2

– Highest score is S1 (as expected)

Illustration (VII)● Emerging clustering:

– C1: Blue Triangle– C2: Blue Square– C3: Red Triangle

● Seems reasonable (?)

http://www-ai.cs.uni-dortmund.de/kdnet/auto?self=$81d91eaae317b2bebb

Discussion● Nice probabilistic model with no

parameters set a priori● Only handles nominal features

(CLASSIT extends to numerical)● Sensitive to order of presentation of

instances● Retains each instance, which may cause

problems with noisy data

CL Overview● Sub-symbolic approach● Weight vectors serve as prototypes

– The more similar the input vector is to the weight vector, the higher the net is

– Weights “move about” until they sit in the “middle” of a cluster

● Simple model: binary inputs and single layer network– Output nodes implement winner-take-all policy– Lateral inhibition enforced between pairs of outputs– Every input is connected to every output with an

adjustable, real-valued weight (sum of weights for each unit = 1)

CL Algorithm● Let

– n: number of inputs– xi: value (0 or 1) of input i– – a: number of inputs set to 1– g: learning rate

● Initialize weights randomly● For each training instance

net j=∑i=1nwij x i

wij=g x ia

-wij if j=argmaxk

net k and 0 otherwise

Illustration (I)● Assume:

– 4 Boolean inputs– 2 clusters:

● .1 .3 .1 .2● .2 .1 .4 .1

– g = 0.4● Training examples

– 1100– 0011– 1110– 1001

C1 C2

note: weights do not sum to 1 here

Illustration (II)● 1100

– C1: 1x0.1+1x0.3+0x0.1+0x0.2 = 0.4– C2: 1x0.2+1x0.1+0x0.4+0x0.1 = 0.3– C1 wins

● .26 .38 .06 .12● .2 .1 .4 .1

● 0011– C1: 0x0.26+0x0.38+1x0.06+1x0.12 = 0.18– C2: 0x0.2+0x0.1+1x0.4+1x0.1 = 0.5– C2 wins

● .26 .38 .06 .12● .12 .06 .44 .26

Illustration (III)● 1110

– C1: 1x.26+1x.38+1x.06+0x.12 = 0.7– C2: 1x.12+1x.06+1x.44+0x.26 = 0.62– C1 wins

● .29 .36 .17 .07● .12 .06 .44 .26

● 1001– C1: 1x.29+0x.36+0x.17+1x.07 = 0.36– C2: 1x.12+0x.06+0x.44+1x.26 = 0.38– C2 wins

● .29 .36 .17 .07● .27 .04 .26 .36

Illustration (IV)● 1100


● .37 .42 .1 .04● .27 .04 .26 .36


● .37 .42 .1 .04● .16 .02 .36 .42

Illustration (V)● 1110

– C1: 1x.37+1x.42+1x.1+0x.04 = 0.89– C2: 1x.16+1x.02+1x.36+0x.42 = 0.54– C1 wins

● .36 .39 .19 .02● .16 .02 .36 .42

● 1001– C1: 1x.36+0x.39+0x.19+1x.02 = 0.38– C2: 1x.16+0x.02+0x.36+1x.42 = 0.58– C2 wins

● .36 .39 .19 .02● .23 .15 .35 .25

Illustration (VI)● 1100


● .42 .43 .11 .01● .23 .15 .35 .25


● .42 .43 .11 .01● .14 .09 .41 .35

Illustration (VII)● Emerging clustering

– C1: 1100, 1110– C2: 0011, 1001

● Seems reasonable (?)

http://www.psychology.mcmaster.ca/4i03/demos/competitive1-demo.html

Discussion● Number of classes set a priori (can be

extended by adding a new node if all of the existing ones produce low activation)

● Learning rate is a parameter (if too large, then priority to early instances, may get stuck on early opinions)

● No measure of goodness for the resulting classification.

● Easily extended to arbitrary inputs (use metric for Net)

Terminology

Item

Itemset

Transaction

Association Rules

● Let U be a set of items and let X, Y U, with X Y =

● An association rule is an expression of the form X Y, whose meaning is:– If the elements of X occur in some context,

then so do the elements of Y

Quality Measures

● The following statistical quantities are relevant to association rule mining:– support(X)

● Pr(X) count(X) / T– support(X Y) (or support(X Y))

● Pr(X Y) = count(X Y) / T– confidence(X Y)

● Pr(Y | X) = count(X Y) / count(X)– lift(X Y)

● confidence(X Y) / support(Y)

count(X): number of transactions where all elements of X appearT: total number of transactions considered

Learning Associations

● The purpose of association rule learning is to find “interesting” rules, i.e., rules that meet the following two user-defined conditions:– support(X Y) MinSupport– confidence(X Y) MinConfidence

Itemsets

● Frequent itemset– An itemset whose support is greater than

some user-specified minimum support (denoted Lk where k is the size of the itemset)

● Candidate itemset– A potentially frequent itemset (denoted Ck

where k is the size of the itemset)

Basic Idea

● Generate all frequent itemsets satisfying the condition on minimum support

● Build all possible rules from these itemsets and check them against the condition on minimum confidence

● All the rules above the minimum confidence threshold are returned for further evaluation

AprioriAll Algorithm● L1 ● For each item Ij I

– count({Ij}) = | {Ti : Ij Ti} |– If count({Ij}) MinSupport x m

● L1 L1 {({Ij}, count({Ij})}● k 2● While Lk-1

Lk For each (l1, count(l1)) Lk-1

● For each (l2, count(l2)) Lk-1 If (l1 = {j1, …, jk-2, x} l2 = {j1, …, jk-2, y} x y)

● l {j1, …, jk-2, x, y}● count(l) | {Ti : l Ti } |● If count(l) MinSupport x m

Lk Lk {(l, count(l))}– k k + 1

Return L1 L2… Lk-1

IllustrationLoan Applications

Client # Credit History Debt Level Collateral Income Level Risk Level1 Bad High None Low High2 Unknown High None Medium High3 Unknown Low None Medium Moderate4 Unknown Low None Low High5 Unknown Low None High Low6 Unknown Low Adequate High Low7 Bad Low None Low High8 Bad Low Adequate High Moderate9 Good Low None High Low

10 Good High Adequate High Low11 Good High None Low High12 Good High None Medium Moderate13 Good High None High Low14 Bad High None Medium High

Running Apriori (I)

● Items:– (CH=Bad, .29) (CH=Unknown, .36) (CH=Good, .36)– (DL=Low, .5) (DL=High, .5)– (C=None, .79) (C=Adequate, .21)– (IL=Low, .29) (IL=Medium, .29) (IL=High, .43)– (RL=High, .43) (RL=Moderate, .21) (RL=Low, .36)

● Choose MinSupport=.4 and MinConfidence=.8

Running Apriori (II)

● L1 = {(DL=Low, .5); (DL=High, .5); (C=None, .79); (IL=High, .43); (RL=High, .43)}

● L2 = {(DL=High + C=None, .43)}

● L3 = {}

Running Apriori (III)

● Two possible rules:– DL = High C = None (A)– C = None DL = High (B)

● Confidences:– Conf(A) = .86 Retain– Conf(B) = .54 Ignore

Discussion

● A “true” data mining algorithm● Despite its popularity, real reported applications are few● Easy to implement with a sparse matrix and simple

sums● Computationally expensive (actual run-time depends on

MinSupport and, in the worst-case, the time complexity is O(2n))

● Not strictly an associations learner (induces rules, which are inherently unidirectional; alternatives, e.g., GRI)

● Extendible for sequence learning

Documents

Introduction to Data Mining Modelling Algorithmsdml.cs.byu.edu/~cgc/docs/atdm/Readings/DMAlgOverview.pdf · 2008. 2. 25. · Information Gain Entropy: – Let S be a set examples