dm_part2

7/31/2019 dm_part2

1/28

7/31/2019 dm_part2

2/28

Classification (Supervised

learning)

Classification: classification analysis is the organization of data in givenclasses. Classification approaches normally use a training setwhere allobjects are already associated with known class labels. The classificationalgorithm learns from the training set and builds a model. The model isused to classify new objects.

7/31/2019 dm_part2

3/28

7/31/2019 dm_part2

4/28

7/31/2019 dm_part2

5/28

7/31/2019 dm_part2

6/28

7/31/2019 dm_part2

7/28

7/31/2019 dm_part2

8/28

7/31/2019 dm_part2

9/28

7/31/2019 dm_part2

10/28

7/31/2019 dm_part2

11/28

7/31/2019 dm_part2

12/28

7/31/2019 dm_part2

13/28

Tree where internal nodes are simpledecision rules on one or more attributesand leaf nodes are predicted class labels.

Decision trees

Salary < 1 M

Prof = teacher

Good

Age < 30

BadBad Good

7/31/2019 dm_part2

14/28

Decision tree classifiers

Widely used learning method

Easy to interpret: can be re-represented as if-then-else rules

Approximates function by piece wise constantregions

Does not require any prior knowledge of datadistribution, works well on noisy data.

Has been applied to: classify medical patients based on the disease,

equipment malfunction by cause,

loan applicant by likelihood of payment.

7/31/2019 dm_part2

15/28

Pros and Cons of decisiontrees

Cons

- Cannot handle complicatedrelationship between features

- simple decision boundaries- problems with lots of missing

data

Pros

+ Reasonable trainingtime

+ Fast application

+ Easy to interpret

+ Easy to implement

+ Can handle largenumber of features

More information:http://www.stat.wisc.edu/~limt/treeprogs.html

7/31/2019 dm_part2

16/28

Neural network

Set of nodes connected by directedweighted edges

Hidden nodes

Output nodes

x1

x2

x3

x1

x2

x3

w1

w2

w3

y

n

i

ii

ey

xwo

1

1

)(

)(1

Basic NN unitA more typical NN

7/31/2019 dm_part2

17/28

Clustering or UnsupervisedLearning

7/31/2019 dm_part2

18/28

What is Cluster Analysis? Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Cluster analysis

Finding similarities between data according to thecharacteristics found in the data and grouping similar

data objects into clusters

Unsupervised learning: no predefined classes

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

7/31/2019 dm_part2

19/28

Clustering

Unsupervised learning when old data with classlabels not available e.g. when introducing a new

product. Group/cluster existing customers based on time

series of payment history such that similarcustomers in same cluster.

Key requirement: Need a good measure ofsimilarity between instances.

Identify micro-markets and develop policies for

each

Cl i Ri h A li i

7/31/2019 dm_part2

20/28

Clustering: Rich Applications

and Multidisciplinary Efforts

Pattern Recognition Spatial Data Analysis

Create thematic maps in GIS by clustering feature

spaces

Detect spatial clusters or for other spatial mining tasks

Image Processing

Economic Science (especially market research)

WWW

Document classification

Cluster Weblog data to discover groups of similar

access patterns

amp es o s er ng

7/31/2019 dm_part2

21/28

xamp es o us er ngApplications

Marketing: Help marketers discover distinct groups in their customer

bases, and then use this knowledge to develop targeted marketing

programs

Land use: Identification of areas of similar land use in an earth

observation database

Insurance: Identifying groups of motor insurance policy holders with a

high average claim cost

City-planning: Identifying groups of houses according to their house

type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults

7/31/2019 dm_part2

22/28

Applications

Customer segmentation e.g. for targetedmarketing

Group/cluster existing customers based on time

series of payment history such that similar customersin same cluster.

Identify micro-markets and develop policies for each

Collaborative filtering:

group based on common items purchased

Text clustering

Compression

7/31/2019 dm_part2

23/28

Quality: What Is GoodClustering?

A good clustering method will produce high quality

clusters with

high intra-class similarity

low inter-class similarity

The quality of a clustering result depends on both the

similarity measure used by the method and its

implementation

The quality of a clustering method is also measured

by its ability to discover some or all of the hidden

patterns

7/31/2019 dm_part2

24/28

Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is

expressed in terms of a distance function,typically metric: d(i, j)

There is a separate quality function that

measures the goodness of a cluster. The definitions of distance functions are usually

very different for interval-scaled, boolean,

categorical, ordinal ratio, and vector variables.

Weights should be associated with different

variables based on applications and data

semantics.

It is hard to define similar enough or good

7/31/2019 dm_part2

25/28

Distance functions

Numeric data: euclidean, manhattan distances

Categorical data: 0/1 to indicatepresence/absence followed by

Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s)

data dependent measures: similarity of A and Bdepends on co-occurance with C.

Combined numeric and categorical data:

weighted normalized distance:

M j Cl t i A h

7/31/2019 dm_part2

26/28

Major Clustering Approaches

(I)

Partitioning approach:

Construct various partitions and then evaluate them by some criterion, e.g.,

minimizing the sum of square errors

Typical methods: k-means, k-medoids, CLARANS

Hierarchical approach:

Create a hierarchical decomposition of the set of data (or objects) using

some criterion

Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Density-based approach:

Based on connectivity and density functions

Typical methods: DBSACN, OPTICS, DenClue

7/31/2019 dm_part2

27/28

Partitional methods: K-means

Criteria: minimize sum of square of distance Between each point and centroid of the cluster.

Between each pair of points in the cluster

Algorithm: Select initial partition with K clusters:

random, first K, K separated points

Repeat until stabilization: Assign each point to closest cluster center

Generate new cluster centers

Adjust clusters by merging/splitting

R i t f Cl t i i D t

7/31/2019 dm_part2

28/28

Requirements of Clustering in Data

Mining Scalability

Ability to deal with different types of attributes

Ability to handle dynamic data

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge todetermine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

Documents

dm_part2