Upload
prerna-mahajan
View
215
Download
0
Embed Size (px)
Citation preview
7/31/2019 dm_part2
1/28
7/31/2019 dm_part2
2/28
Classification (Supervised
learning)
Classification: classification analysis is the organization of data in givenclasses. Classification approaches normally use a training setwhere allobjects are already associated with known class labels. The classificationalgorithm learns from the training set and builds a model. The model isused to classify new objects.
7/31/2019 dm_part2
3/28
7/31/2019 dm_part2
4/28
7/31/2019 dm_part2
5/28
7/31/2019 dm_part2
6/28
7/31/2019 dm_part2
7/28
7/31/2019 dm_part2
8/28
7/31/2019 dm_part2
9/28
7/31/2019 dm_part2
10/28
7/31/2019 dm_part2
11/28
7/31/2019 dm_part2
12/28
7/31/2019 dm_part2
13/28
Tree where internal nodes are simpledecision rules on one or more attributesand leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBad Good
7/31/2019 dm_part2
14/28
Decision tree classifiers
Widely used learning method
Easy to interpret: can be re-represented as if-then-else rules
Approximates function by piece wise constantregions
Does not require any prior knowledge of datadistribution, works well on noisy data.
Has been applied to: classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
7/31/2019 dm_part2
15/28
Pros and Cons of decisiontrees
Cons
- Cannot handle complicatedrelationship between features
- simple decision boundaries- problems with lots of missing
data
Pros
+ Reasonable trainingtime
+ Fast application
+ Easy to interpret
+ Easy to implement
+ Can handle largenumber of features
More information:http://www.stat.wisc.edu/~limt/treeprogs.html
7/31/2019 dm_part2
16/28
Neural network
Set of nodes connected by directedweighted edges
Hidden nodes
Output nodes
x1
x2
x3
x1
x2
x3
w1
w2
w3
y
n
i
ii
ey
xwo
1
1
)(
)(1
Basic NN unitA more typical NN
7/31/2019 dm_part2
17/28
Clustering or UnsupervisedLearning
7/31/2019 dm_part2
18/28
What is Cluster Analysis? Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to thecharacteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
7/31/2019 dm_part2
19/28
Clustering
Unsupervised learning when old data with classlabels not available e.g. when introducing a new
product. Group/cluster existing customers based on time
series of payment history such that similarcustomers in same cluster.
Key requirement: Need a good measure ofsimilarity between instances.
Identify micro-markets and develop policies for
each
Cl i Ri h A li i
7/31/2019 dm_part2
20/28
Clustering: Rich Applications
and Multidisciplinary Efforts
Pattern Recognition Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
amp es o s er ng
7/31/2019 dm_part2
21/28
xamp es o us er ngApplications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
7/31/2019 dm_part2
22/28
Applications
Customer segmentation e.g. for targetedmarketing
Group/cluster existing customers based on time
series of payment history such that similar customersin same cluster.
Identify micro-markets and develop policies for each
Collaborative filtering:
group based on common items purchased
Text clustering
Compression
7/31/2019 dm_part2
23/28
Quality: What Is GoodClustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns
7/31/2019 dm_part2
24/28
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function,typically metric: d(i, j)
There is a separate quality function that
measures the goodness of a cluster. The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define similar enough or good
7/31/2019 dm_part2
25/28
Distance functions
Numeric data: euclidean, manhattan distances
Categorical data: 0/1 to indicatepresence/absence followed by
Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s)
data dependent measures: similarity of A and Bdepends on co-occurance with C.
Combined numeric and categorical data:
weighted normalized distance:
M j Cl t i A h
7/31/2019 dm_part2
26/28
Major Clustering Approaches
(I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
7/31/2019 dm_part2
27/28
Partitional methods: K-means
Criteria: minimize sum of square of distance Between each point and centroid of the cluster.
Between each pair of points in the cluster
Algorithm: Select initial partition with K clusters:
random, first K, K separated points
Repeat until stabilization: Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting
R i t f Cl t i i D t
7/31/2019 dm_part2
28/28
Requirements of Clustering in Data
Mining Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge todetermine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability