Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Data Warehousing and Machine LearningPreprocessing
Thomas D. Nielsen
Aalborg UniversityDepartment of Computer Science
Spring 2008
DWML Spring 2008 1 / 35
Preprocessing
Before you can start on the actual data mining, the data may require some preprocessing:
• Attributes may be redundant.
• Values may be missing.
• The data contains outliers.
• The data is not in a suitable format.
• The values appear inconsistent.
Garbage in, garbage out
DWML Spring 2008 2 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Correct zip code?
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Correct zip code?
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 ?? 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Missing value!
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Error/outlier!
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Error!
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Unexpected precision.
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Categorical value?
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
– Error/missing value?
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
Other issues:• What are the semantics of the marital status?
DWML Spring 2008 3 / 35
Preprocessing
Data Cleaning
ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000
Other issues:• What are the semantics of the marital status?
• What is the unit of measure for the transaction field?
DWML Spring 2008 3 / 35
Preprocessing
Missing Values
In many real world data bases you will be faced with the problem of missing data:
Id. Savings Assets Income Credit Risk($ 1000s)
1 Medium High 75 Good2 Low Low 50 Bad3 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low 25 Bad8 Medium Medium 75 Good
By simply discarding the records with missing data we might unintentionally bias the data.
DWML Spring 2008 4 / 35
Preprocessing
Missing Values
Possible strategies for handling missing data:
• Use a predefined constant.
• Use the mean (for numerical variables) or the mode (for categorical values).
• Use a value drawn randomly form the observed distribution.
Id. Savings Assets Income Credit Risk($ 1000s)
1 Medium High 75 Good2 Low Low 50 Bad3 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low 25 Bad8 Medium Medium 75 Good
DWML Spring 2008 5 / 35
Preprocessing
Missing Values
Possible strategies for handling missing data:
• Use a predefined constant.
• Use the mean (for numerical variables) or the mode (for categorical values).
• Use a value drawn randomly form the observed distribution.
Id. Savings Assets Income Credit Risk($ 1000s)
1 Medium High 75 Good2 Low Low 50 Bad3 Low 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low 25 Bad8 Medium Medium 75 Good
Both Low and Medium are ’modes’ for savings.
DWML Spring 2008 5 / 35
Preprocessing
Missing Values
Possible strategies for handling missing data:
• Use a predefined constant.
• Use the mean (for numerical variables) or the mode (for categorical values).
• Use a value drawn randomly form the observed distribution.
Id. Savings Assets Income Credit Risk($ 1000s)
1 Medium High 75 Good2 Low Low 50 Bad3 Low High 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low Medium 25 Bad8 Medium Medium 75 Good
High and Medium are drawn randomly from the observed distribution for Assets.
DWML Spring 2008 5 / 35
Preprocessing
Missing Values
Possible strategies for handling missing data:
• Use a predefined constant.
• Use the mean (for numerical variables) or the mode (for categorical values).
• Use a value drawn randomly form the observed distribution.
Id. Savings Assets Income Credit Risk($ 1000s)
1 Medium High 75 Good2 Low Low 50 Bad3 Low High 25 Bad4 Medium Medium 54 Good5 Low Medium 100 Good6 High High 25 Good7 Low Medium 25 Bad8 Medium Medium 75 Good
54 ≈75 + 50 + 25 + 100 + 25 + 25 + 75
7.
DWML Spring 2008 5 / 35
Preprocessing
Discretization
Some data mining algorithms can only handle discrete attributes. Possible solution: Divide thecontinuous range into intervals.Example:
(Income, Risk) = 〈(25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G)〉
Unsupervised discretization
Equal width binning (width 25):
Bin 1: 25, 25 [25, 50)Bin 2: 50, 51, 54 [50, 75)Bin 3: 75, 75, 100, 100 [75, 100]
Equal frequency binning (bin density 3):
Bin 1: 25, 25, 50 [25, 50.5)Bin 2: 51, 54, 75, 75 [50.5, 87.5)Bin 3: 100, 100 [87.5, 100]
DWML Spring 2008 6 / 35
Preprocessing
Supervised discretization
Take the class distribution into account when selecting the intervals. For example, recursivelybisect the interval by selecting the split point giving the highest information gain:
Gain(S, v) = Ent(S) −
»
|S≤v |
|S|Ent(S≤v ) +
|S>v |
|S|Ent(S>v )
–
Until some stopping criteria is met.
(Income, Risk) = 〈(25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G)〉
Ent(S) = −
„
3
9log2
3
9+
6
9log2
6
9
«
= 0.9183
Split E-Ent Interval25 0.4602 (−∞, 25], (25,∞)50 0.7395 (−∞, 50], (50,∞)51 0.3606 (−∞, 51], (51,∞)54 0.5394 (−∞, 54], (54,∞)75 0.7663 (−∞, 75], (75,∞)
DWML Spring 2008 7 / 35
Preprocessing
Data Transformation
Some data mining tools tends to give variables with a large range a higher significance thanvariables with a smaller range. For example,
• Age versus income.
DWML Spring 2008 8 / 35
Preprocessing
Data Transformation
Some data mining tools tends to give variables with a large range a higher significance thanvariables with a smaller range. For example,
• Age versus income.
The typical approach is to standardize the scales:
Min-Max Normalization:
X∗ =X − min(X)
max(X) − min(X).
0
0.2
0.4
0.6
0.8
1
-20 0 20 40 60 80 100 120
norm
aliz
ed v
alue
s
original values
A1A2
DWML Spring 2008 8 / 35
Preprocessing
Data Transformation
Some data mining tools tends to give variables with a large range a higher significance thanvariables with a smaller range. For example,
• Age versus income.
The typical approach is to standardize the scales:
Min-Max Normalization:
X∗ =X − min(X)
max(X) − min(X).
0
0.2
0.4
0.6
0.8
1
-20 0 20 40 60 80 100 120no
rmal
ized
val
ues
original values
A1A2
Z-score standardization:
X∗ =X − mean(X)
SD(X).
-4
-3
-2
-1
0
1
2
3
-20 0 20 40 60 80 100 120
stan
dard
ized
val
ues
original values
A1A2
DWML Spring 2008 8 / 35
Preprocessing
Outliers
Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20
Summary statistics:
• First quartile (1Q): 25% of the data = 4 .
• Second quartile (2Q): 50% of the data = 6.
• Third quartile (3Q): 75% of the data = 7.
Interquartile range IQR = 3Q − 1Q = 3.
DWML Spring 2008 9 / 35
Preprocessing
Outliers
Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20
Summary statistics:
• First quartile (1Q): 25% of the data = 4 .
• Second quartile (2Q): 50% of the data = 6.
• Third quartile (3Q): 75% of the data = 7.
Interquartile range IQR = 3Q − 1Q = 3.
A data point may be an outlier if:
• It is lower than 1Q − 1.5 · IQR = 4 − 1.5 · 3 = −0.5.
• It is higher than 3Q + 1.5 · IQR = 7 + 1.5 · 3 = 11.5.
DWML Spring 2008 9 / 35
Data Warehousing and Machine LearningClustering
Thomas D. Nielsen
Aalborg UniversityDepartment of Computer Science
Spring 2008
Clustering: partitional and hierarchical DWML Spring 2008 10 / 35
Clustering
Unlabeled Data
The Iris data with class labels removed:
AttributesSL SW PL PW5.1 3.5 1.4 0.24.9 3.0 1.4 0.26.3 2.9 6.0 2.16.3 2.5 4.9 1.5. . . . . . . . . . . .
Unlabeled data in general: (discrete or continuous) attributes, no class variable.
Clustering: partitional and hierarchical DWML Spring 2008 11 / 35
Clustering
Clustering
A clustering of the data S = s1, . . . , sN consists of a set C = {c1, . . . , ck} of cluster labels, and acluster assignment ca : S → C.
Clustering Iris withC = {blue, red}:
Note: a clustering partitions the datapoints, not necessarily the instance space. When clusterlabels have no particular significance, can identify clustering also with partition S = S1 ∪ . . . ∪ Skwhere Si = ca−1(ci ).
Clustering: partitional and hierarchical DWML Spring 2008 12 / 35
Clustering
Clustering goal
Instance Space
Within−cluster distances
Between−cluster distances
A candidate clustering (indicated by colors) of data cases in instance space. Arrows indicatebetween- and within-cluster distances (selected).
General goal: find clustering with large between-cluster variation (sum of between-clusterdistances), and small within-cluster variation (sum of within-cluster distances). Concrete goalvaries according to exact distance definition.
Clustering: partitional and hierarchical DWML Spring 2008 13 / 35
Clustering
Examples
• Group plants/animals into families or related species, based on• morphological features• molecular features
• Identify types of customers based on attributes in a database (can then be targeted by specialadvertising campaigns)
• Web mining: group web-pages according to content
Clustering: partitional and hierarchical DWML Spring 2008 14 / 35
Clustering
Clustering vs. Classification
The cluster label can be interpreted as a hidden class variable
• that is never observed
• whose number of states is unknown
• on which the distribution of attribute values depends
Clustering is often called unsupervised learning, vs. the supervised learning of classifiers: insupervised learning correct class labels for the training data are provided to the learning algorithmby a supervisor, or teacher.
One key problem in clustering is determining the “right” number of clusters. Two differentapproaches:
• Partition-based clustering
• Hierarchical clustering
All clustering methods require a distance measure on the instance space!
Clustering: partitional and hierarchical DWML Spring 2008 15 / 35
Clustering
Partition-based Clustering
Number k of clusters fixed (user defined). Partition data into k clusters.
k -means clustering
Assume that
• there is a distance function d(s, s′) defined between data items
• we can compute the mean value of a collection {s1, . . . , sl} of data items
Initialize: randomly pick initial cluster centers c = c1, . . . , ck from Srepeat
for i = 1, . . . , kSi := {s ∈ S | ci = arg minc∈c d(c, s)}cold,i := cici := mean Sica(s) := ci (s ∈ Si)
until c = cold
Clustering: partitional and hierarchical DWML Spring 2008 16 / 35
Clustering
Example
k = 3:
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example
k = 3:
c1 c2 c3
S1 S2 S3
Clustering: partitional and hierarchical DWML Spring 2008 17 / 35
Clustering
Example(cont.)
Result for clustering the same data with k = 2:
c1 c2
S1 S2
Result can depend on choice of initial cluster centers!
Clustering: partitional and hierarchical DWML Spring 2008 18 / 35
Clustering
Outliers
The result of partitional clustering can be skewed by outliers. Example with k = 2:
useful preprocessing: outlier detection and elimination (be careful not to eliminate interestingoutliers!).
Clustering: partitional and hierarchical DWML Spring 2008 19 / 35
Clustering
k -Means as optimization
With a Euclidean distance function dist we can use the sum of squared errors for evaluating aclustering:
SSE =k
X
i=1
X
x∈Ci
dist(x, ci)2.
Clustering: partitional and hierarchical DWML Spring 2008 20 / 35
Clustering
k -Means as optimization
With a Euclidean distance function dist we can use the sum of squared errors for evaluating aclustering:
SSE =k
X
i=1
X
x∈Ci
dist(x, ci)2.
k -means directly tries to minimize this error:
Initialize: randomly pick initial cluster centers c = c1, . . . , ck from Srepeat
for i = 1, . . . , kSi := {s ∈ S | ci = arg minc∈c d(c, s)} //Minimize the SSE for the current clusterscold,i := cici := mean Si //The centroid that minimizes the SSE for the assigned objectsca(s) := ci (s ∈ Si)
until c = cold
Only guaranteed to find a local minimum
Clustering: partitional and hierarchical DWML Spring 2008 20 / 35
Hierarchical Clustering
Reducing SSE
Choosing initial centroids:
• Perform multiple runs with random initializations.
• Initialize centroids based on results from another algorithm (e.g. hierarchical).
• . . .
Clustering: partitional and hierarchical DWML Spring 2008 21 / 35
Hierarchical Clustering
Reducing SSE
Choosing initial centroids:
• Perform multiple runs with random initializations.
• Initialize centroids based on results from another algorithm (e.g. hierarchical).
• . . .
Postprocessing:
• Split a cluster
• Disperse a cluster (choose the one that increases the SSE the least)
• Merge two clusters (the two with closets centroids or the two that increases the SSE theleast).
Clustering: partitional and hierarchical DWML Spring 2008 21 / 35
Hierarchical Clustering
Hierarchical clustering
The “right” number of clusters may not only be unknown, it may also be quite ambiguous:
Clustering: partitional and hierarchical DWML Spring 2008 22 / 35
Hierarchical Clustering
Hierarchical clustering
The “right” number of clusters may not only be unknown, it may also be quite ambiguous:
Clustering: partitional and hierarchical DWML Spring 2008 22 / 35
Hierarchical Clustering
Hierarchical clustering
The “right” number of clusters may not only be unknown, it may also be quite ambiguous:
Clustering: partitional and hierarchical DWML Spring 2008 22 / 35
Hierarchical Clustering
Hierarchical clustering
The “right” number of clusters may not only be unknown, it may also be quite ambiguous:
Provide an explicit representation of nested clusterings of different granularity
Clustering: partitional and hierarchical DWML Spring 2008 22 / 35
Hierarchical Clustering
Agglomerative hierarchical clustering
Extend distance function d(s, s′) to distance function D(S, S′) between sets of data items. Twoout of many possibilities:
Daverage(S, S′) :=1
|S| · |S′|
X
s∈S,s′∈S′
d(s, s′) Dmin(S, S′) := mins∈S,s′∈S′d(s, s′)
for i = 1, . . . , N: Si := {si}while current partition S1 ∪ . . . ∪ Sk of S contains more than one element
(i , j) := arg mini,j∈1,...,k D(Si , Sj )form new partition by merging Si and Sj .
When Daverage is used, this is also called average link clustering; for Dmin: single link clustering.
Clustering: partitional and hierarchical DWML Spring 2008 23 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Clustering: partitional and hierarchical DWML Spring 2008 24 / 35
Hierarchical Clustering
Dendrogram Representation of Hierarchical Clustering
Dis
tan
ce o
f m
erg
ed
co
mp
on
en
ts
Clustering: partitional and hierarchical DWML Spring 2008 25 / 35
Hierarchical Clustering
Dendrogram Representation of Hierarchical Clustering
Dis
tan
ce o
f m
erg
ed
co
mp
on
en
ts
3−clustering
5−clustering
The length of the distance interval correponding to a specific clustering can be interpreted as ameasure for the significance of this particular clustering
Clustering: partitional and hierarchical DWML Spring 2008 25 / 35
Hierarchical Clustering
Single link vs. Average link
Clustering: partitional and hierarchical DWML Spring 2008 26 / 35
Hierarchical Clustering
Single link vs. Average link
4-clustering for single link and average link
Clustering: partitional and hierarchical DWML Spring 2008 26 / 35
Hierarchical Clustering
Single link vs. Average link
4-clustering for single link and average linksingle link 2-clustering
Clustering: partitional and hierarchical DWML Spring 2008 26 / 35
Hierarchical Clustering
Single link vs. Average link
4-clustering for single link and average linksingle link 2-clusteringaverage link 2-clustering
Clustering: partitional and hierarchical DWML Spring 2008 26 / 35
Hierarchical Clustering
Single link vs. Average link
4-clustering for single link and average linksingle link 2-clusteringaverage link 2-clusteringGenerally: single link will produce rather elongated, linear clusters, average link more convexclusters
Clustering: partitional and hierarchical DWML Spring 2008 26 / 35
Hierarchical Clustering
Another Example
Clustering: partitional and hierarchical DWML Spring 2008 27 / 35
Hierarchical Clustering
Another Example
single link 2-clustering
Clustering: partitional and hierarchical DWML Spring 2008 27 / 35
Hierarchical Clustering
Another Example
average link 2-clustering (or similar)
Clustering: partitional and hierarchical DWML Spring 2008 27 / 35
Data Warehousing and Machine LearningSelf Organizing Maps
Thomas D. Nielsen
Aalborg UniversityDepartment of Computer Science
Spring 2008
Self Organizing Maps DWML Spring 2008 28 / 35
Self Organizing Maps
SOMs as Special Neural Networks
Input Layer
Output Layer
• Neural network structure without hidden layers• Output neurons structured as two-dimensional array• Connection from i th input to j th output has weight wi,j
• No activation function for output nodes
Self Organizing Maps DWML Spring 2008 29 / 35
Self Organizing Maps
Kohonen Learning
Given: Unlabeled data a1, . . . , aN ∈ Rn
Distance measure dn(·, ·) on Rn
Distance measure dout(·, ·) on output neuronsUpdate function η(t, d) : N × R → R; decreasing in t and d .
1. Initialize weight vectors w(0)j for output nodes oj
2. t := 03. repeat4. t := t + 15. for i = 1, . . . , N6. let oj be the output neuron minimizing dn(w j , ai ).7. for all output nodes oh:
8. w(t)h := w(t−1)
h + η(t, dout(oh , oj))(ai − w(t−1)h )
9. until termination condition applies
Self Organizing Maps DWML Spring 2008 30 / 35
Self Organizing Maps
Distances etc.
Possible choices:
dn: Euclideandout(oj , oh): e.g. 1 if oj , oh are neighbors (rectangular or hexagonal layout), or
Euclidean distance on grid indicesη(t, d): e.g. α(t)exp(−d2/2σ2(t)) with α(t), σ(t) decreasing in t .
Self Organizing Maps DWML Spring 2008 31 / 35
Self Organizing Maps
Intuition
SOM learning can be understood as fitting a 2-dimensional surface to the data:
o1,0
o0,0o1,1
o0,1Colors indicate associ-ation with different out-put neurons, not data at-tributes.Some output neuronsmay not have any asso-ciated data cases.
Self Organizing Maps DWML Spring 2008 32 / 35
Self Organizing Maps
Example (from Tan et al.)
Data: Word occurrence data (?) from 3204 articles from the Los Angeles Times with (hidden)section labels Entertainment, Financial, Foreign, Metro, National, Sports.
Result of SOM clustering on 4 × 4 hexagonal grid:
Sports MetroSports Metro
ForeignSports MetroSports
Entertainment
Entertainment
Metro Metro National
Metro Financial Financial
high
low
Density
Output nodes labelled with majority label of associated cases and colored according to number ofcases associated with it (fictional).
Self Organizing Maps DWML Spring 2008 33 / 35
Self Organizing Maps
SOMs and k -means
In spite of its roots in neural networks, SOMs are more closely related to k -means clustering:
• Weight vectors w j are cluster centers
• Kohonen updating associates data cases with cluster centers, and repositions cluster centersto fit associated data cases
• Differences:• 2-dim. “spatial” relationship among cluster centers• Data cases associated with more than one cluster center• On-line updating (one case at a time)
Self Organizing Maps DWML Spring 2008 34 / 35
Self Organizing Maps
Pros and Cons
+ Provides more insight than a basic clustering (i.e. partitioning of data)
+ Can produce intuitive representations of clustering results
- No well-defined objective function that is optimized
Self Organizing Maps DWML Spring 2008 35 / 35