Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Introduction to Data Mining Modelling Algorithms
Aim and Scope● Brief introduction to main concepts and
modelling algorithms for DM applications● Enough background to use existing
commercial/freeware tools on simple tasks● Not exhaustive: only representatives of
each class of models are presented
Data Mining Approaches● Predictive Modeling
– Decision trees: ID3– Instance-based: k-NN– Probabilistic: NB, Linear/Logistic Regression– Connectionist: Backpropagation
● Descriptive Modeling– Summarization– Clustering: k-Means, COBWEB, Competitive Learning– Association Rule: Apriori
● Genetic Algorithms
Decision Tree
● Structure:– Internal nodes tests on some property– Branches values of associated property– Leaf nodes classifications– Classification: traverse tree from root to a
leaf● Goal: construct a decision tree that allows
the classification of objects from examples– Occam's Razor
ID3 Algorithm
● Function ID3(Example-set, Properties)– If all elements in Example-set are in the same class, then return
a leaf node labeled with that class– Else if Properties is empty, then return a leaf node labeled with
the majority class in Example-set– Else
● Select P from Properties so as to maximize information gain● Remove P from Properties● Make P the root of the current tree● For each value V of P
– Create a branch of the current tree labeled by V– Partition_V Elements of Example-set with value V for P– ID3(Partition_V, Properties)– Attach result to branch V
Information Gain
● Entropy:– Let S be a set examples from c classes
Entropy S =∑i=1
c
−p i log2 p i where pi is the proportion of examples
of S belonging to class i (Note, we define 0log0=0)
The smaller the entropy, the “purer” the partitionThe smaller the entropy, the “purer” the partition Information Gain:Information Gain:
Let p be a property with n outcomesLet p be a property with n outcomes Partitioning S based on P results in a gain of:Partitioning S based on P results in a gain of:
Gain S,p =Entropy S −∑i= 1
n ∣S i∣∣S∣
Entropy S i where Si is the subset of S for which
property p has its ith value
Illustrative Training SetRisk Assessment for Loan Applications
Client # Credit History Debt Level Collateral Income Level RISK LEVEL1 Bad High None Low HIGH2 Unknown High None Medium HIGH3 Unknown Low None Medium MODERATE4 Unknown Low None Low HIGH5 Unknown Low None High LOW6 Unknown Low Adequate High LOW7 Bad Low None Low HIGH8 Bad Low Adequate High MODERATE9 Good Low None High LOW
10 Good High Adequate High LOW11 Good High None Low HIGH12 Good High None Medium MODERATE13 Good High None High LOW14 Bad High None Medium HIGH
ID3 Example (I)
1) Choose Income as root of tree.
Income High
Medium Low
1,4,7,11 2,3,12,14 5,6,8,9,10,13 4) 2) 3)
2) All examples are in the same class, HIGH. Return Leaf Node.
Income High
Medium Low
HIGH 2,3,12,14 5,6,8,9,10,13
3) Choose Debt Level as root of subtree.
Debt Level High Low
3 2,12,14 3b) 3a)
3a) All examples are in the same class, MODERATE. Return Leaf node.
Debt Level High Low
MODERATE 2,12,14 3b)
ID3 Example (II)
3b) Choose Credit History as root of subtree.
Credit History Good
Bad Unknown
2 14 12 3b’’’)
3b’’) 3b’)
3b’-3b’’’) All examples are in the same class. Return Leaf nodes.
Credit History Good
Bad Unknown
HIGH HIGH MODERATE
4) Choose Credit History as root of subtree.
Credit History Good
Bad Unknown
5,6 8 9,10,13 4c) 4b) 4a)
4a-4c) All examples are in the same class. Return Leaf nodes.
Credit History Good
Bad Unknown
LOW MODERATE LOW
ID3 Example (III) Attach subtrees at appropr iate places.
Income
High Medium Low
HIGH Debt Level High Low
MODERA TE Credit History
Good Bad Unknown
HIGH HIGH MODERA TE
Credit History Good
Bad Unknown
LOW MODERA TE LOW
Discussion
● ID3 handles only discrete attributes: extensions to numerical attributes have been proposed, the most famous being C4.5/C5.0
● Experience shows that decision trees tend to produce very good results on many problems
● Trees are most attractive when end users want interpretable knowledge from their data
● Overfitting likely: use pruning
Instance-based
● Does not build an explicit model● Training data is stored and, when a new
query instance is encountered, a set of similar, related instances is retrieved from memory and used to classify the new query instance
● Lazy learning: simply compute the classification of each new query instance as needed
k-NN Algorithm
● For each training instance t=(x, f(x))– Add t to the set Tr_instances
● Given a query instance q to be classified– Let x1, …, xk be the k instances in Tr_instances nearest to q– Return
Intuitively, the k-NN algorithm assigns to each new query instance the majority class among its k nearest neighbors
where V is the finite set of target class values, and δ(a,b)=1 if a=b, and 0 otherwise
f q =argmaxv∈V
∑i=1
k
δ v , f x i
Illustration
q is + under 1-NN, but – under 5-NN
q
+
++-
--
---
+-
Discussion● Works well on many practical problems and
fairly noise tolerant (depending on k)● Some issues:
– Irrelevant attributes dominate the distance function: use weighted distance function or remove irrelevant attributes
– Long query processing: implement indexing schemes– Large memory requirement: implement instance
reduction strategies– Non-Euclidean/heterogeneous spaces:
extend/design adequate distance measures
Probabilistic Learning
● We are often interested in determining the best hypothesis from some space H, given the observed training data D.
● One way to specify what is meant by the best hypothesis is to say that we demand the most probable hypothesis, given the data D together with any initial knowledge about the prior probabilities of the various hypotheses in H.
Remark
● Instead of finding the most probable hypothesis given the training data, often, it is the following related question that is most relevant:– Which is the most probable classification of
the new query instance given the training data?
● In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.
Bayes Optimal Classification
● If the possible classification of the new instance can take on any value vj from some set V, then the probability P(vj|D) that the correct classification for the new instance is vj , is given by:
P v j∣D =∑h i∈H
P v j∣h i P hi∣D
Then, the optimal classification of the new instance is:
Nice, but impractical for large hypothesis spaces
argmaxv j∈V
∑hi∈H
P v j∣hi P h i∣D
Naïve Bayes Learning
● A practical Bayesian learning method● Applies when instances are conjunctions of
attribute values and the target takes its values from some finite set V
● Consists in assigning to a new query instance the most probable target value, vMAP, given the attribute values a1, …, an that describe the instance, i.e.,
vMAP=argmaxv j∈V
P v j∣a1 , , an
NB Algorithm
● Using Bayes theorem, this can be reformulated as:
Making the further simplifying assumption that attribute values are conditionally independent given the target value, one can write the conjunctive conditional probability as a product of simple conditional probabilities, producing the algorithm:
ReturnReturn argmaxv j∈V
P v j ∏i=1
n
P a i∣v j
argmaxv j∈V
P a1 ,... , an∣v j P v j P a1 ,... , an
Illustrative Training SetRisk Assessment for Loan Applications
Client # Credit History Debt Level Collateral Income Level RISK LEVEL1 Bad High None Low HIGH2 Unknown High None Medium HIGH3 Unknown Low None Medium MODERATE4 Unknown Low None Low HIGH5 Unknown Low None High LOW6 Unknown Low Adequate High LOW7 Bad Low None Low HIGH8 Bad Low Adequate High MODERATE9 Good Low None High LOW
10 Good High Adequate High LOW11 Good High None Low HIGH12 Good High None Medium MODERATE13 Good High None High LOW14 Bad High None Medium HIGH
NB ExampleCredit History Risk Level
High Moderate Low High Moderate LowUnknown 0.33 0.33 0.40 6 3 5 14Bad 0.50 0.33 0.00 0.43 0.21 0.36 1.00Good 0.17 0.33 0.60
Debt Level High Moderate LowHigh 0.67 0.33 0.40 Consider the query instance: (Bad, Low, Adequate, Medium)Low 0.33 0.67 0.60
High 0.00%Collateral High Moderate Low Moderate 1.06%
None 1.00 0.67 0.60 Low 0.00%Adequate 0.00 0.33 0.40 Prediction: Moderate
Income Level High Moderate LowHigh 0.00 0.33 1.00 Consider the query instance: (Bad, High, None, Low) - SeenMedium 0.33 0.67 0.00Low 0.67 0.00 0.00 High 9.52%
Moderate 0.00%Low 0.00%
Prediction: High
Discussion
Whenever the assumption of conditional independence is satisfied, NB classification is optimal
NB is inherently incremental NB estimates probabilities by frequencies Assume P(X=x|Y=y)=0.05 and the training set is s.t. ny=5.
Then, it is highly probable that nx|y=0. The fraction is thus an underestimate of the probability. It will “dominate” the NB classifier for all new queries with X=x.
Use m-estimate: replace nx|y/ny by (nx|y+mp)/(ny+m), where p is our prior estimate of the probability we wish to determine and m is a constant (typically, p=1/(number of values of X) and m acts as a weight, similar to adding m virtual instances distributed according to p)
Linear Regression● Fit a linear model to data where the dependent
variable is continuous:
● Given a set of points (Xi,Yi), we wish to find a linear function (or line in 2 dimensions) that “goes through” these points.
● In general, the points are not exactly aligned:– Find line that best fits the points
Y=β0+β1 X 1+β2 X 2+K+βn X n+e
Sum-squared Error (SSE)
The smaller the SSE, the better the fitThe smaller the SSE, the better the fit Hence, linear regression attempts to Hence, linear regression attempts to
minimize SSE, or similarly to maximize Rminimize SSE, or similarly to maximize R22
SSE=∑y yobserved− y predicted
2
TSS=∑y y observed−yobserved
2R2=1−SSE
TSS
Analytical Solution
β0=∑ y−β1∑ xn
β1=n∑ xy−∑ x∑ yn∑ x2−∑ x 2
In 2 dimensions:
In n>2 dimensions: matrix arithmetic
Illustration
223.6195.3158.0024.1066.0030.2512.005.5047.8421.1610.404.6040.4016.0010.104.0027.2011.568.003.4024.499.617.903.1012.885.295.602.304.801.444.001.20xyx^2yx
(target: y=2x+1.5)x y (obs) y (pred) SSE TSS1.20 4.00 3.94 0.004 18.9402.30 5.60 6.07 0.223 4.9203.10 7.90 7.62 0.076 0.4443.40 8.00 8.21 0.042 0.0074.00 10.10 9.37 0.533 1.1664.60 10.40 10.53 0.018 5.0365.50 12.00 12.28 0.078 15.920
0.974 46.432
β1 = 1.94
β0 = 1.61
R2 = 0.98
Logistic Regression● Fit a curve to data in which the dependent variable is
binary, or dichotomous
Regression Curve:Sigmoid function!
(bounded by asymptotes y=0 and y=1)
Observations:For each value of SurvRate, the number of dots is the number of patients with that value of NewOut
Solution Given some event with probability p of being 1, the odds
of that event are given by:
The logit is the natural log of the oddslogit(p) = ln(odds) = ln (p/(1-p))
In logistic regression, we seek a model (2 dim. here):
That is, the logit is assumed to be linearly related to the independent variables, and we solve an ordinary (linear) regression!
odds = p / (1–p)
logit p =β0+β1 X
Illustration (I)
Age GroupNo Yes
1 9 1 10 (20-29)2 13 2 15 (30-34)3 9 3 12 (35-39)4 10 5 15 (40-44)5 7 6 13 (45-49)6 3 5 8 (50-54)7 4 13 17 (55-59)8 2 8 10 (60-69)
Total 57 43 100
Coronary Heart Disease Total
Illustration (II)
Age Group p(CHD)=1 odds log odds #occ1 0.1000 0.1111 -2.1972 102 0.1333 0.1538 -1.8718 153 0.2500 0.3333 -1.0986 124 0.3333 0.5000 -0.6931 155 0.4615 0.8571 -0.1542 136 0.6250 1.6667 0.5108 87 0.7647 3.2500 1.1787 178 0.8000 4.0000 1.3863 10
Illustration (III)X (AG) Y (log odds) X^2 XY #occ1 -2.1972 1.0000 -2.1972 102 -1.8718 4.0000 -3.7436 153 -1.0986 9.0000 -3.2958 124 -0.6931 16.0000 -2.7726 155 -0.1542 25.0000 -0.7708 136 0.5108 36.0000 3.0650 87 1.1787 49.0000 8.2506 178 1.3863 64.0000 11.0904 10
448 -37.6471 2504.0000 106.3981 100
Note: the sums reflect the number of occurrences(Sum(X) = X1.#occ(X1)+…+X8.#occ(X8), etc.)
Illustration (IV)● Results from regression:
– β0 = -2.856 and β1 = 0.5535Age Group p(CHD)=1 est. p
1 0.1000 0.09092 0.1333 0.14823 0.2500 0.23234 0.3333 0.34485 0.4615 0.47786 0.6250 0.61427 0.7647 0.73468 0.8000 0.8280
SSE 0.0028TSS 0.5265
R2 0.9946
Recovering Probabilities
ln p1− p
=β0β1 X
which gives p as a sigmoid function!
p1− p
=eβ 0β1 X
p= eβ 0β1 X
1eβ0β1 X
= 11e
− β0 β1 X
Discussion● Regression is a powerful data mining
technique It provides prediction It offers insight into the relative power of each
independent variable Technique of choice in medicine and
social sciences
Multi-layer Feed-forward NN
i
i
i
i
j
k
k
k
Training Error
● We define the training error of a hypothesis, or weight vector, over a set of data D, by:
E w =12 ∑d∈D
td−od 2
Which we will seek to minimize
Non-linear Activation
● Introduce non-linearity with sigmoid function:
net=∑i=1
n
w i x i
out= 11+e−net
Differentiable (required for gradient-descent) Most unstable in the middle
d out d net
=out .1−out
Backpropagation
● Initialize all weights randomly● Repeat
– Present a training instance– Compute error δk of output units– For each hidden layer
● Compute error δj using error from next layer– Update all weights: wijwij+Δwij (with Δwij=ηOiδj)
● Until (E < CriticalError)
Network Equations
Δw ij=ηOiδ jOutput where δ j Hidden
T j−O j f' net j ∑
kδ k w jk f
' net j
f ' net j =O j 1−O j
Illustration (I)
● Consider a simple network composed of:– 3 inputs: a, b, c– 1 hidden node: h– 2 outputs: q, r
● Assume η=0.5, all weights are initialized to 0.2 and weight updates are incremental
● Consider the training set:– 1 0 1 – 0 1– 0 1 1 – 1 1
● 4 iterations over the training set
Illustration (II)a b c W a-h W b-h W c-h h W h-q W h-r Out q Out r Target q Target r
0.2 0.2 0.2 0.2 0.21 0 1 0.2 0.2 0.2 0.6 0.2 0.2 0.53 0.53 0 1 -0.13 -0.04 0.12 0.04 0 0
update weights 0.2 0.2 0.2 0.16 0.240 1 1 0.2 0.2 0.2 0.6 0.16 0.24 0.52 0.54 1 1 0.12 0.04 0.12 0.03 0.01 0
update weights 0.2 0.21 0.21 0.2 0.271 0 1 0.2 0.21 0.21 0.6 0.2 0.27 0.53 0.54 0 1 -0.13 -0.04 0.11 0.03 0 0
update weights 0.2 0.21 0.21 0.16 0.30 1 1 0.2 0.21 0.21 0.6 0.16 0.3 0.52 0.55 1 1 0.12 0.04 0.11 0.03 0.01 0
update weights 0.2 0.21 0.21 0.19 0.341 0 1 0.2 0.21 0.21 0.6 0.19 0.34 0.53 0.55 0 1 -0.13 -0.04 0.11 0.03 0 0
update weights 0.2 0.21 0.21 0.15 0.370 1 1 0.2 0.21 0.21 0.6 0.15 0.37 0.52 0.56 1 1 0.12 0.04 0.11 0.03 0.01 0
update weights 0.2 0.22 0.22 0.19 0.41 0 1 0.2 0.22 0.22 0.6 0.19 0.4 0.53 0.56 0 1 -0.13 -0.04 0.11 0.03 0 0
update weights 0.2 0.22 0.22 0.15 0.440 1 1 0.2 0.22 0.22 0.61 0.15 0.44 0.52 0.57 1 1 0.12 0.04 0.11 0.03 0.02 0
update weights 0.2 0.23 0.23 0.19 0.47
δ q Δ W h-q δ r Δ W h-r δ h Δ W a-hintialization
Discussion
● 3-layer BPNN: Universal Function Approximators
● Require many passes over the data● Potential for massive parallelism● Convergence to global minimum not guaranteed
– Use momentum term: ● Keep moving through small local (global!) minima or along
flat regions– Use incremental/stochastic version of the algorithm– Train multiple networks with different starting weights
● Select best on hold-out validation set● Combine outputs (e.g., weighted average)
Δw ijn =ηOiδ j+αΔwij n−1
Check Your Understanding● young,myope,no,reduced,none● young,myope,no,normal,soft● young,myope,yes,reduced,none● young,myope,yes,normal,hard● young,hypermetrope,no,reduced,none● young,hypermetrope,no,normal,soft● young,hypermetrope,yes,reduced,none● young,hypermetrope,yes,normal,hard● pre-presbyopic,myope,no,reduced,none● pre-presbyopic,myope,no,normal,soft● pre-presbyopic,myope,yes,reduced,none● pre-presbyopic,myope,yes,normal,hard● pre-presbyopic,hypermetrope,no,reduced,none● pre-presbyopic,hypermetrope,no,normal,soft● pre-presbyopic,hypermetrope,yes,reduced,none● pre-presbyopic,hypermetrope,yes,normal,none● presbyopic,myope,no,reduced,none● presbyopic,myope,no,normal,none● presbyopic,myope,yes,reduced,none● presbyopic,myope,yes,normal,hard● presbyopic,hypermetrope,no,reduced,none● presbyopic,hypermetrope,no,normal,soft● presbyopic,hypermetrope,yes,reduced,none● presbyopic,hypermetrope,yes,normal,none
Attribute Information:Attribute Information: 1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic 2. spectacle prescription: (1) myope, (2) hypermetrope2. spectacle prescription: (1) myope, (2) hypermetrope 3. astigmatic: (1) no, (2) yes3. astigmatic: (1) no, (2) yes 4. tear production rate: (1) reduced, (2) normal4. tear production rate: (1) reduced, (2) normal Class Distribution:Class Distribution: 1. hard contact lenses: 41. hard contact lenses: 4 2. soft contact lenses: 52. soft contact lenses: 5 3. no contact lenses: 153. no contact lenses: 15
Apply all of the above predictive Apply all of the above predictive modelling algorithms to this dataset modelling algorithms to this dataset
Summarization
● A compact description for a subset of the data
● Retrospective analysis
● Techniques:– Statistics– Information theory– Online analytical processing (OLAP)
Illustration● Average down time of all plant
equipments in a given month● Total income generated by each sales
representative per region per year● Proportion of each type of surgical
procedure performed by gender and ethnicity
Estimating Means
● What is the maximum likelihood hypothesis for the mean of a single Normal distribution given observed instances from it?– The hypothesis that minimizes the SSE, which in this
case, happens to be the sample mean:
µML=argminµ
∑i=1
n
x i−µ 2 =1n∑i=1
n
x i
What if there are k means to estimate?
Using Hidden Variables
● We have k hidden variables● Each training instance is extended from xi
to < xi, zi1, zi2, …, zik>:– xi is the observed instance– zij are the hidden variables– zij = 1 if xi was generated by the jth Gaussian
and 0 otherwise
k-Means Algorithm
● Initialization:– Set h = where the µi ’s are arbitrary values
● Step 1:– Calculate C[zij] for each hidden variable zij, assuming the
current hypothesis h: 1 if xi is closest to µj, 0 otherwise● Step 2:
– Calculate a new maximum likelihood hypothesis h’ = , assuming the value taken on by each zij is C[zij] as calculated in Step 1: mean of the xi's for which C[zij]=1
– Replace h by h’– If stopping condition is not met, go to Step 1
Intuitively
1. Pick a number (k) of cluster centers (at random2. Assign every item to its nearest cluster center
(e.g., using Euclidean distance)3. Move each cluster center to the mean of its
assigned items4. Repeat steps 2 and 3 until convergence (i.e.,
change in cluster assignments less than a threshold)
Illustration (I)
k1
k2
k3
X
Y
Pick 3 initialclustercenters(randomly)
Illustration (II)
k1
k2
k3
X
Y
Assigneach pointto the closestclustercenter
Illustration (III)
X
Y
Moveeach cluster centerto the meanof each cluster
k1
k2
k2
k1
k3
k3
Illustration (IV)
X
Y
three points have been “re-labelled”
k1
k3k2
Reassignpoints closest to a different new cluster center
Illustration (V)
X
Yre-compute cluster means
k1
k3k2
Illustration (VI)
X
Y
move cluster centers to cluster means
no more changes: Done!
k2
k1
k3
Discussion Simple Items automatically assigned to clusters Requires some distance function Must pick number of clusters beforehand
Result can vary significantly depending on initial choice of seeds
To increase chances of finding global optimum: restart with different random seeds
Sensitive to outliers
COBWEB Overview● Symbolic approach to category formation● Uses global quality metric to determine
number of clusters, depth of hierarchy, and category membership of new instances
● Categories are probabilistic● Algorithm is incremental
Predictability
● P(Fi=v
ij | C
k) is called predictability
– Probability that an object has value vij for feature F
i given that the object belongs to
category Ck
– The greater this probability, the more likely two objects in a category share the same features
Predictiveness
● P(Ck | F
i=v
ij) is called predictiveness
– Probability with which an object belongs to category C
k given that it has value v
ij for
feature Fi
– The greater this probability, the less likely objects not in the category will have those feature values.
Category Utility
● P(Fi=v
ij) serves as a weight
– Ensures that frequently-occurring feature values exert a stronger influence on the evaluation
● CU maximizes the potential for inferring information while maximizing intra-class similarity and inter-class differences.
CU=∑k∑j∑i
PF i=v ij∣C k P C k∣F i=v ijP F i=v ij
Tree Representation● Each node stores:
– Its probability of occurrence, P(Ck) (= num. instances at node / total num. instances)
– All possible values of every feature observed in the instances, and for each such value, its predictability.
– Predictiveness, computed using Bayes rule ● Leaf nodes correspond to observed
instances● All links are “is-a” links
COBWEB Algorithm● Function Cobweb(Node, Inst)
– If Node is leaf● Create 2 children, L
1 and L
2, of Node
● Set probabilities of L1 to those of Node
● Initialize probabilities of L2 to those of Inst
● Add Inst to Node, updating Node's probabilities– Else
● Add Inst to Node, updating Node's probabilities● For each child C of Node, compute CU of taxonomy obtained by placing Inst in C● Compute
– S1 = score of best categorization C1 – S2 = score of next best categorization C2– S3 = score of placing Inst in new category– S4 = score of merging C1 and C2– S5 = score of splitting C1
● Case:– S1 is best score: Cobweb(C1, Inst)– S3 is best score: Initialize new category's probabilities to those of Inst– S4 is best score: Let Cm be the result of merging C1 and C2; Cobweb(Cm, Inst)– S5 is best score: Split C1; Cobweb(Node, Inst)– Default: Cobweb(C2, Inst)
Illustration (I)● Assume 2 attributes only
– Color: Blue, Red– Shape: Triangle, Square
● Examples:– Blue Triangle– Blue Square– Red Triangle– Blue Triangle
Illustration (II)
Illustration (III)
Illustration (IV)
Illustration (V)● Merge C1and C2
– Same as having only root node: CU=10/9● Split C1
– Same as creating new category: CU=2● Hence:
– S1=5/3– S2=4/3– S3=2– S4=10/9– S5=2
Illustration (VI)● Highest score is S3
– Create a new category● Repeat process for next example
– Blue Triangle– Results:
● S1=2● S2=7/4● S3=2● S4=7/6● S5=2
– Highest score is S1 (as expected)
Illustration (VII)● Emerging clustering:
– C1: Blue Triangle– C2: Blue Square– C3: Red Triangle
● Seems reasonable (?)
http://www-ai.cs.uni-dortmund.de/kdnet/auto?self=$81d91eaae317b2bebb
Discussion● Nice probabilistic model with no
parameters set a priori● Only handles nominal features
(CLASSIT extends to numerical)● Sensitive to order of presentation of
instances● Retains each instance, which may cause
problems with noisy data
CL Overview● Sub-symbolic approach● Weight vectors serve as prototypes
– The more similar the input vector is to the weight vector, the higher the net is
– Weights “move about” until they sit in the “middle” of a cluster
● Simple model: binary inputs and single layer network– Output nodes implement winner-take-all policy– Lateral inhibition enforced between pairs of outputs– Every input is connected to every output with an
adjustable, real-valued weight (sum of weights for each unit = 1)
CL Algorithm● Let
– n: number of inputs– xi: value (0 or 1) of input i– – a: number of inputs set to 1– g: learning rate
● Initialize weights randomly● For each training instance
net j=∑i=1nwij x i
wij=g x ia
-wij if j=argmaxk
net k and 0 otherwise
Illustration (I)● Assume:
– 4 Boolean inputs– 2 clusters:
● .1 .3 .1 .2● .2 .1 .4 .1
– g = 0.4● Training examples
– 1100– 0011– 1110– 1001
C1 C2
note: weights do not sum to 1 here
Illustration (II)● 1100
– C1: 1x0.1+1x0.3+0x0.1+0x0.2 = 0.4– C2: 1x0.2+1x0.1+0x0.4+0x0.1 = 0.3– C1 wins
● .26 .38 .06 .12● .2 .1 .4 .1
● 0011– C1: 0x0.26+0x0.38+1x0.06+1x0.12 = 0.18– C2: 0x0.2+0x0.1+1x0.4+1x0.1 = 0.5– C2 wins
● .26 .38 .06 .12● .12 .06 .44 .26
Illustration (III)● 1110
– C1: 1x.26+1x.38+1x.06+0x.12 = 0.7– C2: 1x.12+1x.06+1x.44+0x.26 = 0.62– C1 wins
● .29 .36 .17 .07● .12 .06 .44 .26
● 1001– C1: 1x.29+0x.36+0x.17+1x.07 = 0.36– C2: 1x.12+0x.06+0x.44+1x.26 = 0.38– C2 wins
● .29 .36 .17 .07● .27 .04 .26 .36
Illustration (IV)● 1100
– C1: 1x0.29+1x0.36+0x0.17+0x0.07 = 0.65– C2: 1x0.27+1x0.04+0x0.26+0x0.36 = 0.31– C1 wins
● .37 .42 .1 .04● .27 .04 .26 .36
● 0011– C1: 0x0.37+0x0.42+1x0.1+1x0.04 = 0.14– C2: 0x0.27+0x0.04+1x0.26+1x0.36 = 0.62– C2 wins
● .37 .42 .1 .04● .16 .02 .36 .42
Illustration (V)● 1110
– C1: 1x.37+1x.42+1x.1+0x.04 = 0.89– C2: 1x.16+1x.02+1x.36+0x.42 = 0.54– C1 wins
● .36 .39 .19 .02● .16 .02 .36 .42
● 1001– C1: 1x.36+0x.39+0x.19+1x.02 = 0.38– C2: 1x.16+0x.02+0x.36+1x.42 = 0.58– C2 wins
● .36 .39 .19 .02● .23 .15 .35 .25
Illustration (VI)● 1100
– C1: 1x0.36+1x0.39+0x0.19+0x0.02 = 0.75– C2: 1x0.23+1x0.15+0x0.35+0x0.25 = 0.38– C1 wins
● .42 .43 .11 .01● .23 .15 .35 .25
● 0011– C1: 0x0.42+0x0.43+1x0.11+1x0.01 = 0.12– C2: 0x0.23+0x0.15+1x0.35+1x0.25 = 0.6– C2 wins
● .42 .43 .11 .01● .14 .09 .41 .35
Illustration (VII)● Emerging clustering
– C1: 1100, 1110– C2: 0011, 1001
● Seems reasonable (?)
http://www.psychology.mcmaster.ca/4i03/demos/competitive1-demo.html
Discussion● Number of classes set a priori (can be
extended by adding a new node if all of the existing ones produce low activation)
● Learning rate is a parameter (if too large, then priority to early instances, may get stuck on early opinions)
● No measure of goodness for the resulting classification.
● Easily extended to arbitrary inputs (use metric for Net)
Terminology
Item
Itemset
Transaction
Association Rules
● Let U be a set of items and let X, Y U, with X Y =
● An association rule is an expression of the form X Y, whose meaning is:– If the elements of X occur in some context,
then so do the elements of Y
Quality Measures
● The following statistical quantities are relevant to association rule mining:– support(X)
● Pr(X) count(X) / T– support(X Y) (or support(X Y))
● Pr(X Y) = count(X Y) / T– confidence(X Y)
● Pr(Y | X) = count(X Y) / count(X)– lift(X Y)
● confidence(X Y) / support(Y)
count(X): number of transactions where all elements of X appearT: total number of transactions considered
Learning Associations
● The purpose of association rule learning is to find “interesting” rules, i.e., rules that meet the following two user-defined conditions:– support(X Y) MinSupport– confidence(X Y) MinConfidence
Itemsets
● Frequent itemset– An itemset whose support is greater than
some user-specified minimum support (denoted Lk where k is the size of the itemset)
● Candidate itemset– A potentially frequent itemset (denoted Ck
where k is the size of the itemset)
Basic Idea
● Generate all frequent itemsets satisfying the condition on minimum support
● Build all possible rules from these itemsets and check them against the condition on minimum confidence
● All the rules above the minimum confidence threshold are returned for further evaluation
AprioriAll Algorithm● L1 ● For each item Ij I
– count({Ij}) = | {Ti : Ij Ti} |– If count({Ij}) MinSupport x m
● L1 L1 {({Ij}, count({Ij})}● k 2● While Lk-1
Lk For each (l1, count(l1)) Lk-1
● For each (l2, count(l2)) Lk-1 If (l1 = {j1, …, jk-2, x} l2 = {j1, …, jk-2, y} x y)
● l {j1, …, jk-2, x, y}● count(l) | {Ti : l Ti } |● If count(l) MinSupport x m
Lk Lk {(l, count(l))}– k k + 1
Return L1 L2… Lk-1
IllustrationLoan Applications
Client # Credit History Debt Level Collateral Income Level Risk Level1 Bad High None Low High2 Unknown High None Medium High3 Unknown Low None Medium Moderate4 Unknown Low None Low High5 Unknown Low None High Low6 Unknown Low Adequate High Low7 Bad Low None Low High8 Bad Low Adequate High Moderate9 Good Low None High Low
10 Good High Adequate High Low11 Good High None Low High12 Good High None Medium Moderate13 Good High None High Low14 Bad High None Medium High
Running Apriori (I)
● Items:– (CH=Bad, .29) (CH=Unknown, .36) (CH=Good, .36)– (DL=Low, .5) (DL=High, .5)– (C=None, .79) (C=Adequate, .21)– (IL=Low, .29) (IL=Medium, .29) (IL=High, .43)– (RL=High, .43) (RL=Moderate, .21) (RL=Low, .36)
● Choose MinSupport=.4 and MinConfidence=.8
Running Apriori (II)
● L1 = {(DL=Low, .5); (DL=High, .5); (C=None, .79); (IL=High, .43); (RL=High, .43)}
● L2 = {(DL=High + C=None, .43)}
● L3 = {}
Running Apriori (III)
● Two possible rules:– DL = High C = None (A)– C = None DL = High (B)
● Confidences:– Conf(A) = .86 Retain– Conf(B) = .54 Ignore
Discussion
● A “true” data mining algorithm● Despite its popularity, real reported applications are few● Easy to implement with a sparse matrix and simple
sums● Computationally expensive (actual run-time depends on
MinSupport and, in the worst-case, the time complexity is O(2n))
● Not strictly an associations learner (induces rules, which are inherently unidirectional; alternatives, e.g., GRI)
● Extendible for sequence learning