84
Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35

Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Data Warehousing and Machine LearningPreprocessing

Thomas D. Nielsen

Aalborg UniversityDepartment of Computer Science

Spring 2008

DWML Spring 2008 1 / 35

Page 2: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Before you can start on the actual data mining, the data may require some preprocessing:

• Attributes may be redundant.

• Values may be missing.

• The data contains outliers.

• The data is not in a suitable format.

• The values appear inconsistent.

Garbage in, garbage out

DWML Spring 2008 2 / 35

Page 3: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

DWML Spring 2008 3 / 35

Page 4: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Correct zip code?

DWML Spring 2008 3 / 35

Page 5: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Correct zip code?

DWML Spring 2008 3 / 35

Page 6: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 ?? 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Missing value!

DWML Spring 2008 3 / 35

Page 7: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Error/outlier!

DWML Spring 2008 3 / 35

Page 8: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Error!

DWML Spring 2008 3 / 35

Page 9: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Unexpected precision.

DWML Spring 2008 3 / 35

Page 10: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Categorical value?

DWML Spring 2008 3 / 35

Page 11: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

– Error/missing value?

DWML Spring 2008 3 / 35

Page 12: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

Other issues:• What are the semantics of the marital status?

DWML Spring 2008 3 / 35

Page 13: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Cleaning

ID Zip Gander Income Age Marital status Transaction amount1001 10048 M 75000 C M 50001002 J2S7K7 F -40000 40 W 40001003 90210 10000000 45 S 70001004 6269 M 50000 0 S 10001005 55101 F 99999 30 D 3000

Other issues:• What are the semantics of the marital status?

• What is the unit of measure for the transaction field?

DWML Spring 2008 3 / 35

Page 14: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Missing Values

In many real world data bases you will be faced with the problem of missing data:

Id. Savings Assets Income Credit Risk($ 1000s)

1 Medium High 75 Good2 Low Low 50 Bad3 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low 25 Bad8 Medium Medium 75 Good

By simply discarding the records with missing data we might unintentionally bias the data.

DWML Spring 2008 4 / 35

Page 15: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Missing Values

Possible strategies for handling missing data:

• Use a predefined constant.

• Use the mean (for numerical variables) or the mode (for categorical values).

• Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk($ 1000s)

1 Medium High 75 Good2 Low Low 50 Bad3 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low 25 Bad8 Medium Medium 75 Good

DWML Spring 2008 5 / 35

Page 16: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Missing Values

Possible strategies for handling missing data:

• Use a predefined constant.

• Use the mean (for numerical variables) or the mode (for categorical values).

• Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk($ 1000s)

1 Medium High 75 Good2 Low Low 50 Bad3 Low 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low 25 Bad8 Medium Medium 75 Good

Both Low and Medium are ’modes’ for savings.

DWML Spring 2008 5 / 35

Page 17: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Missing Values

Possible strategies for handling missing data:

• Use a predefined constant.

• Use the mean (for numerical variables) or the mode (for categorical values).

• Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk($ 1000s)

1 Medium High 75 Good2 Low Low 50 Bad3 Low High 25 Bad4 Medium Medium Good5 Low Medium 100 Good6 High High 25 Good7 Low Medium 25 Bad8 Medium Medium 75 Good

High and Medium are drawn randomly from the observed distribution for Assets.

DWML Spring 2008 5 / 35

Page 18: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Missing Values

Possible strategies for handling missing data:

• Use a predefined constant.

• Use the mean (for numerical variables) or the mode (for categorical values).

• Use a value drawn randomly form the observed distribution.

Id. Savings Assets Income Credit Risk($ 1000s)

1 Medium High 75 Good2 Low Low 50 Bad3 Low High 25 Bad4 Medium Medium 54 Good5 Low Medium 100 Good6 High High 25 Good7 Low Medium 25 Bad8 Medium Medium 75 Good

54 ≈75 + 50 + 25 + 100 + 25 + 25 + 75

7.

DWML Spring 2008 5 / 35

Page 19: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Discretization

Some data mining algorithms can only handle discrete attributes. Possible solution: Divide thecontinuous range into intervals.Example:

(Income, Risk) = 〈(25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G)〉

Unsupervised discretization

Equal width binning (width 25):

Bin 1: 25, 25 [25, 50)Bin 2: 50, 51, 54 [50, 75)Bin 3: 75, 75, 100, 100 [75, 100]

Equal frequency binning (bin density 3):

Bin 1: 25, 25, 50 [25, 50.5)Bin 2: 51, 54, 75, 75 [50.5, 87.5)Bin 3: 100, 100 [87.5, 100]

DWML Spring 2008 6 / 35

Page 20: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Supervised discretization

Take the class distribution into account when selecting the intervals. For example, recursivelybisect the interval by selecting the split point giving the highest information gain:

Gain(S, v) = Ent(S) −

»

|S≤v |

|S|Ent(S≤v ) +

|S>v |

|S|Ent(S>v )

Until some stopping criteria is met.

(Income, Risk) = 〈(25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G)〉

Ent(S) = −

3

9log2

3

9+

6

9log2

6

9

«

= 0.9183

Split E-Ent Interval25 0.4602 (−∞, 25], (25,∞)50 0.7395 (−∞, 50], (50,∞)51 0.3606 (−∞, 51], (51,∞)54 0.5394 (−∞, 54], (54,∞)75 0.7663 (−∞, 75], (75,∞)

DWML Spring 2008 7 / 35

Page 21: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Transformation

Some data mining tools tends to give variables with a large range a higher significance thanvariables with a smaller range. For example,

• Age versus income.

DWML Spring 2008 8 / 35

Page 22: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Transformation

Some data mining tools tends to give variables with a large range a higher significance thanvariables with a smaller range. For example,

• Age versus income.

The typical approach is to standardize the scales:

Min-Max Normalization:

X∗ =X − min(X)

max(X) − min(X).

0

0.2

0.4

0.6

0.8

1

-20 0 20 40 60 80 100 120

norm

aliz

ed v

alue

s

original values

A1A2

DWML Spring 2008 8 / 35

Page 23: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Data Transformation

Some data mining tools tends to give variables with a large range a higher significance thanvariables with a smaller range. For example,

• Age versus income.

The typical approach is to standardize the scales:

Min-Max Normalization:

X∗ =X − min(X)

max(X) − min(X).

0

0.2

0.4

0.6

0.8

1

-20 0 20 40 60 80 100 120no

rmal

ized

val

ues

original values

A1A2

Z-score standardization:

X∗ =X − mean(X)

SD(X).

-4

-3

-2

-1

0

1

2

3

-20 0 20 40 60 80 100 120

stan

dard

ized

val

ues

original values

A1A2

DWML Spring 2008 8 / 35

Page 24: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Outliers

Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5 10 15 20

Summary statistics:

• First quartile (1Q): 25% of the data = 4 .

• Second quartile (2Q): 50% of the data = 6.

• Third quartile (3Q): 75% of the data = 7.

Interquartile range IQR = 3Q − 1Q = 3.

DWML Spring 2008 9 / 35

Page 25: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Preprocessing

Outliers

Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, 20.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5 10 15 20

Summary statistics:

• First quartile (1Q): 25% of the data = 4 .

• Second quartile (2Q): 50% of the data = 6.

• Third quartile (3Q): 75% of the data = 7.

Interquartile range IQR = 3Q − 1Q = 3.

A data point may be an outlier if:

• It is lower than 1Q − 1.5 · IQR = 4 − 1.5 · 3 = −0.5.

• It is higher than 3Q + 1.5 · IQR = 7 + 1.5 · 3 = 11.5.

DWML Spring 2008 9 / 35

Page 26: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Data Warehousing and Machine LearningClustering

Thomas D. Nielsen

Aalborg UniversityDepartment of Computer Science

Spring 2008

Clustering: partitional and hierarchical DWML Spring 2008 10 / 35

Page 27: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Unlabeled Data

The Iris data with class labels removed:

AttributesSL SW PL PW5.1 3.5 1.4 0.24.9 3.0 1.4 0.26.3 2.9 6.0 2.16.3 2.5 4.9 1.5. . . . . . . . . . . .

Unlabeled data in general: (discrete or continuous) attributes, no class variable.

Clustering: partitional and hierarchical DWML Spring 2008 11 / 35

Page 28: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Clustering

A clustering of the data S = s1, . . . , sN consists of a set C = {c1, . . . , ck} of cluster labels, and acluster assignment ca : S → C.

Clustering Iris withC = {blue, red}:

Note: a clustering partitions the datapoints, not necessarily the instance space. When clusterlabels have no particular significance, can identify clustering also with partition S = S1 ∪ . . . ∪ Skwhere Si = ca−1(ci ).

Clustering: partitional and hierarchical DWML Spring 2008 12 / 35

Page 29: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Clustering goal

Instance Space

Within−cluster distances

Between−cluster distances

A candidate clustering (indicated by colors) of data cases in instance space. Arrows indicatebetween- and within-cluster distances (selected).

General goal: find clustering with large between-cluster variation (sum of between-clusterdistances), and small within-cluster variation (sum of within-cluster distances). Concrete goalvaries according to exact distance definition.

Clustering: partitional and hierarchical DWML Spring 2008 13 / 35

Page 30: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Examples

• Group plants/animals into families or related species, based on• morphological features• molecular features

• Identify types of customers based on attributes in a database (can then be targeted by specialadvertising campaigns)

• Web mining: group web-pages according to content

Clustering: partitional and hierarchical DWML Spring 2008 14 / 35

Page 31: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Clustering vs. Classification

The cluster label can be interpreted as a hidden class variable

• that is never observed

• whose number of states is unknown

• on which the distribution of attribute values depends

Clustering is often called unsupervised learning, vs. the supervised learning of classifiers: insupervised learning correct class labels for the training data are provided to the learning algorithmby a supervisor, or teacher.

One key problem in clustering is determining the “right” number of clusters. Two differentapproaches:

• Partition-based clustering

• Hierarchical clustering

All clustering methods require a distance measure on the instance space!

Clustering: partitional and hierarchical DWML Spring 2008 15 / 35

Page 32: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Partition-based Clustering

Number k of clusters fixed (user defined). Partition data into k clusters.

k -means clustering

Assume that

• there is a distance function d(s, s′) defined between data items

• we can compute the mean value of a collection {s1, . . . , sl} of data items

Initialize: randomly pick initial cluster centers c = c1, . . . , ck from Srepeat

for i = 1, . . . , kSi := {s ∈ S | ci = arg minc∈c d(c, s)}cold,i := cici := mean Sica(s) := ci (s ∈ Si)

until c = cold

Clustering: partitional and hierarchical DWML Spring 2008 16 / 35

Page 33: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 34: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 35: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 36: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 37: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 38: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 39: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 40: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 41: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 42: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example

k = 3:

c1 c2 c3

S1 S2 S3

Clustering: partitional and hierarchical DWML Spring 2008 17 / 35

Page 43: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Example(cont.)

Result for clustering the same data with k = 2:

c1 c2

S1 S2

Result can depend on choice of initial cluster centers!

Clustering: partitional and hierarchical DWML Spring 2008 18 / 35

Page 44: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

Outliers

The result of partitional clustering can be skewed by outliers. Example with k = 2:

useful preprocessing: outlier detection and elimination (be careful not to eliminate interestingoutliers!).

Clustering: partitional and hierarchical DWML Spring 2008 19 / 35

Page 45: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

k -Means as optimization

With a Euclidean distance function dist we can use the sum of squared errors for evaluating aclustering:

SSE =k

X

i=1

X

x∈Ci

dist(x, ci)2.

Clustering: partitional and hierarchical DWML Spring 2008 20 / 35

Page 46: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Clustering

k -Means as optimization

With a Euclidean distance function dist we can use the sum of squared errors for evaluating aclustering:

SSE =k

X

i=1

X

x∈Ci

dist(x, ci)2.

k -means directly tries to minimize this error:

Initialize: randomly pick initial cluster centers c = c1, . . . , ck from Srepeat

for i = 1, . . . , kSi := {s ∈ S | ci = arg minc∈c d(c, s)} //Minimize the SSE for the current clusterscold,i := cici := mean Si //The centroid that minimizes the SSE for the assigned objectsca(s) := ci (s ∈ Si)

until c = cold

Only guaranteed to find a local minimum

Clustering: partitional and hierarchical DWML Spring 2008 20 / 35

Page 47: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Reducing SSE

Choosing initial centroids:

• Perform multiple runs with random initializations.

• Initialize centroids based on results from another algorithm (e.g. hierarchical).

• . . .

Clustering: partitional and hierarchical DWML Spring 2008 21 / 35

Page 48: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Reducing SSE

Choosing initial centroids:

• Perform multiple runs with random initializations.

• Initialize centroids based on results from another algorithm (e.g. hierarchical).

• . . .

Postprocessing:

• Split a cluster

• Disperse a cluster (choose the one that increases the SSE the least)

• Merge two clusters (the two with closets centroids or the two that increases the SSE theleast).

Clustering: partitional and hierarchical DWML Spring 2008 21 / 35

Page 49: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Hierarchical clustering

The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

Page 50: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Hierarchical clustering

The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

Page 51: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Hierarchical clustering

The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

Page 52: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Hierarchical clustering

The “right” number of clusters may not only be unknown, it may also be quite ambiguous:

Provide an explicit representation of nested clusterings of different granularity

Clustering: partitional and hierarchical DWML Spring 2008 22 / 35

Page 53: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Agglomerative hierarchical clustering

Extend distance function d(s, s′) to distance function D(S, S′) between sets of data items. Twoout of many possibilities:

Daverage(S, S′) :=1

|S| · |S′|

X

s∈S,s′∈S′

d(s, s′) Dmin(S, S′) := mins∈S,s′∈S′d(s, s′)

for i = 1, . . . , N: Si := {si}while current partition S1 ∪ . . . ∪ Sk of S contains more than one element

(i , j) := arg mini,j∈1,...,k D(Si , Sj )form new partition by merging Si and Sj .

When Daverage is used, this is also called average link clustering; for Dmin: single link clustering.

Clustering: partitional and hierarchical DWML Spring 2008 23 / 35

Page 54: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 55: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 56: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 57: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 58: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 59: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 60: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 61: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 62: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 63: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 64: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 65: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 66: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Clustering: partitional and hierarchical DWML Spring 2008 24 / 35

Page 67: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Dendrogram Representation of Hierarchical Clustering

Dis

tan

ce o

f m

erg

ed

co

mp

on

en

ts

Clustering: partitional and hierarchical DWML Spring 2008 25 / 35

Page 68: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Dendrogram Representation of Hierarchical Clustering

Dis

tan

ce o

f m

erg

ed

co

mp

on

en

ts

3−clustering

5−clustering

The length of the distance interval correponding to a specific clustering can be interpreted as ameasure for the significance of this particular clustering

Clustering: partitional and hierarchical DWML Spring 2008 25 / 35

Page 69: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Single link vs. Average link

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

Page 70: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Single link vs. Average link

4-clustering for single link and average link

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

Page 71: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Single link vs. Average link

4-clustering for single link and average linksingle link 2-clustering

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

Page 72: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Single link vs. Average link

4-clustering for single link and average linksingle link 2-clusteringaverage link 2-clustering

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

Page 73: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Single link vs. Average link

4-clustering for single link and average linksingle link 2-clusteringaverage link 2-clusteringGenerally: single link will produce rather elongated, linear clusters, average link more convexclusters

Clustering: partitional and hierarchical DWML Spring 2008 26 / 35

Page 74: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Another Example

Clustering: partitional and hierarchical DWML Spring 2008 27 / 35

Page 75: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Another Example

single link 2-clustering

Clustering: partitional and hierarchical DWML Spring 2008 27 / 35

Page 76: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Hierarchical Clustering

Another Example

average link 2-clustering (or similar)

Clustering: partitional and hierarchical DWML Spring 2008 27 / 35

Page 77: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Data Warehousing and Machine LearningSelf Organizing Maps

Thomas D. Nielsen

Aalborg UniversityDepartment of Computer Science

Spring 2008

Self Organizing Maps DWML Spring 2008 28 / 35

Page 78: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

SOMs as Special Neural Networks

Input Layer

Output Layer

• Neural network structure without hidden layers• Output neurons structured as two-dimensional array• Connection from i th input to j th output has weight wi,j

• No activation function for output nodes

Self Organizing Maps DWML Spring 2008 29 / 35

Page 79: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

Kohonen Learning

Given: Unlabeled data a1, . . . , aN ∈ Rn

Distance measure dn(·, ·) on Rn

Distance measure dout(·, ·) on output neuronsUpdate function η(t, d) : N × R → R; decreasing in t and d .

1. Initialize weight vectors w(0)j for output nodes oj

2. t := 03. repeat4. t := t + 15. for i = 1, . . . , N6. let oj be the output neuron minimizing dn(w j , ai ).7. for all output nodes oh:

8. w(t)h := w(t−1)

h + η(t, dout(oh , oj))(ai − w(t−1)h )

9. until termination condition applies

Self Organizing Maps DWML Spring 2008 30 / 35

Page 80: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

Distances etc.

Possible choices:

dn: Euclideandout(oj , oh): e.g. 1 if oj , oh are neighbors (rectangular or hexagonal layout), or

Euclidean distance on grid indicesη(t, d): e.g. α(t)exp(−d2/2σ2(t)) with α(t), σ(t) decreasing in t .

Self Organizing Maps DWML Spring 2008 31 / 35

Page 81: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

Intuition

SOM learning can be understood as fitting a 2-dimensional surface to the data:

o1,0

o0,0o1,1

o0,1Colors indicate associ-ation with different out-put neurons, not data at-tributes.Some output neuronsmay not have any asso-ciated data cases.

Self Organizing Maps DWML Spring 2008 32 / 35

Page 82: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

Example (from Tan et al.)

Data: Word occurrence data (?) from 3204 articles from the Los Angeles Times with (hidden)section labels Entertainment, Financial, Foreign, Metro, National, Sports.

Result of SOM clustering on 4 × 4 hexagonal grid:

Sports MetroSports Metro

ForeignSports MetroSports

Entertainment

Entertainment

Metro Metro National

Metro Financial Financial

high

low

Density

Output nodes labelled with majority label of associated cases and colored according to number ofcases associated with it (fictional).

Self Organizing Maps DWML Spring 2008 33 / 35

Page 83: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

SOMs and k -means

In spite of its roots in neural networks, SOMs are more closely related to k -means clustering:

• Weight vectors w j are cluster centers

• Kohonen updating associates data cases with cluster centers, and repositions cluster centersto fit associated data cases

• Differences:• 2-dim. “spatial” relationship among cluster centers• Data cases associated with more than one cluster center• On-line updating (one case at a time)

Self Organizing Maps DWML Spring 2008 34 / 35

Page 84: Data Warehousing and Machine Learning - Preprocessingpeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/dwml08-4.pdf · Before you can start on the actual data mining, the data may require

Self Organizing Maps

Pros and Cons

+ Provides more insight than a basic clustering (i.e. partitioning of data)

+ Can produce intuitive representations of clustering results

- No well-defined objective function that is optimized

Self Organizing Maps DWML Spring 2008 35 / 35