dm_part2

Embed Size (px)

Citation preview

  • 7/31/2019 dm_part2

    1/28

  • 7/31/2019 dm_part2

    2/28

    Classification (Supervised

    learning)

    Classification: classification analysis is the organization of data in givenclasses. Classification approaches normally use a training setwhere allobjects are already associated with known class labels. The classificationalgorithm learns from the training set and builds a model. The model isused to classify new objects.

  • 7/31/2019 dm_part2

    3/28

  • 7/31/2019 dm_part2

    4/28

  • 7/31/2019 dm_part2

    5/28

  • 7/31/2019 dm_part2

    6/28

  • 7/31/2019 dm_part2

    7/28

  • 7/31/2019 dm_part2

    8/28

  • 7/31/2019 dm_part2

    9/28

  • 7/31/2019 dm_part2

    10/28

  • 7/31/2019 dm_part2

    11/28

  • 7/31/2019 dm_part2

    12/28

  • 7/31/2019 dm_part2

    13/28

    Tree where internal nodes are simpledecision rules on one or more attributesand leaf nodes are predicted class labels.

    Decision trees

    Salary < 1 M

    Prof = teacher

    Good

    Age < 30

    BadBad Good

  • 7/31/2019 dm_part2

    14/28

    Decision tree classifiers

    Widely used learning method

    Easy to interpret: can be re-represented as if-then-else rules

    Approximates function by piece wise constantregions

    Does not require any prior knowledge of datadistribution, works well on noisy data.

    Has been applied to: classify medical patients based on the disease,

    equipment malfunction by cause,

    loan applicant by likelihood of payment.

  • 7/31/2019 dm_part2

    15/28

    Pros and Cons of decisiontrees

    Cons

    - Cannot handle complicatedrelationship between features

    - simple decision boundaries- problems with lots of missing

    data

    Pros

    + Reasonable trainingtime

    + Fast application

    + Easy to interpret

    + Easy to implement

    + Can handle largenumber of features

    More information:http://www.stat.wisc.edu/~limt/treeprogs.html

  • 7/31/2019 dm_part2

    16/28

    Neural network

    Set of nodes connected by directedweighted edges

    Hidden nodes

    Output nodes

    x1

    x2

    x3

    x1

    x2

    x3

    w1

    w2

    w3

    y

    n

    i

    ii

    ey

    xwo

    1

    1

    )(

    )(1

    Basic NN unitA more typical NN

  • 7/31/2019 dm_part2

    17/28

    Clustering or UnsupervisedLearning

  • 7/31/2019 dm_part2

    18/28

    What is Cluster Analysis? Cluster: a collection of data objects

    Similar to one another within the same cluster

    Dissimilar to the objects in other clusters

    Cluster analysis

    Finding similarities between data according to thecharacteristics found in the data and grouping similar

    data objects into clusters

    Unsupervised learning: no predefined classes

    Typical applications

    As a stand-alone tool to get insight into data distribution

    As a preprocessing step for other algorithms

  • 7/31/2019 dm_part2

    19/28

    Clustering

    Unsupervised learning when old data with classlabels not available e.g. when introducing a new

    product. Group/cluster existing customers based on time

    series of payment history such that similarcustomers in same cluster.

    Key requirement: Need a good measure ofsimilarity between instances.

    Identify micro-markets and develop policies for

    each

    Cl i Ri h A li i

  • 7/31/2019 dm_part2

    20/28

    Clustering: Rich Applications

    and Multidisciplinary Efforts

    Pattern Recognition Spatial Data Analysis

    Create thematic maps in GIS by clustering feature

    spaces

    Detect spatial clusters or for other spatial mining tasks

    Image Processing

    Economic Science (especially market research)

    WWW

    Document classification

    Cluster Weblog data to discover groups of similar

    access patterns

    amp es o s er ng

  • 7/31/2019 dm_part2

    21/28

    xamp es o us er ngApplications

    Marketing: Help marketers discover distinct groups in their customer

    bases, and then use this knowledge to develop targeted marketing

    programs

    Land use: Identification of areas of similar land use in an earth

    observation database

    Insurance: Identifying groups of motor insurance policy holders with a

    high average claim cost

    City-planning: Identifying groups of houses according to their house

    type, value, and geographical location

    Earth-quake studies: Observed earth quake epicenters should be

    clustered along continent faults

  • 7/31/2019 dm_part2

    22/28

    Applications

    Customer segmentation e.g. for targetedmarketing

    Group/cluster existing customers based on time

    series of payment history such that similar customersin same cluster.

    Identify micro-markets and develop policies for each

    Collaborative filtering:

    group based on common items purchased

    Text clustering

    Compression

  • 7/31/2019 dm_part2

    23/28

    Quality: What Is GoodClustering?

    A good clustering method will produce high quality

    clusters with

    high intra-class similarity

    low inter-class similarity

    The quality of a clustering result depends on both the

    similarity measure used by the method and its

    implementation

    The quality of a clustering method is also measured

    by its ability to discover some or all of the hidden

    patterns

  • 7/31/2019 dm_part2

    24/28

    Measure the Quality of Clustering

    Dissimilarity/Similarity metric: Similarity is

    expressed in terms of a distance function,typically metric: d(i, j)

    There is a separate quality function that

    measures the goodness of a cluster. The definitions of distance functions are usually

    very different for interval-scaled, boolean,

    categorical, ordinal ratio, and vector variables.

    Weights should be associated with different

    variables based on applications and data

    semantics.

    It is hard to define similar enough or good

  • 7/31/2019 dm_part2

    25/28

    Distance functions

    Numeric data: euclidean, manhattan distances

    Categorical data: 0/1 to indicatepresence/absence followed by

    Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s)

    data dependent measures: similarity of A and Bdepends on co-occurance with C.

    Combined numeric and categorical data:

    weighted normalized distance:

    M j Cl t i A h

  • 7/31/2019 dm_part2

    26/28

    Major Clustering Approaches

    (I)

    Partitioning approach:

    Construct various partitions and then evaluate them by some criterion, e.g.,

    minimizing the sum of square errors

    Typical methods: k-means, k-medoids, CLARANS

    Hierarchical approach:

    Create a hierarchical decomposition of the set of data (or objects) using

    some criterion

    Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

    Density-based approach:

    Based on connectivity and density functions

    Typical methods: DBSACN, OPTICS, DenClue

  • 7/31/2019 dm_part2

    27/28

    Partitional methods: K-means

    Criteria: minimize sum of square of distance Between each point and centroid of the cluster.

    Between each pair of points in the cluster

    Algorithm: Select initial partition with K clusters:

    random, first K, K separated points

    Repeat until stabilization: Assign each point to closest cluster center

    Generate new cluster centers

    Adjust clusters by merging/splitting

    R i t f Cl t i i D t

  • 7/31/2019 dm_part2

    28/28

    Requirements of Clustering in Data

    Mining Scalability

    Ability to deal with different types of attributes

    Ability to handle dynamic data

    Discovery of clusters with arbitrary shape

    Minimal requirements for domain knowledge todetermine input parameters

    Able to deal with noise and outliers

    Insensitive to order of input records

    High dimensionality

    Incorporation of user-specified constraints

    Interpretability and usability