KDD Overview Xintao Wu. What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc

KDD Overview

Xintao Wu

What is data mining?

Data mining is extraction of useful patterns from data

sources, e.g., databases, texts, web, images, etc.

Patterns must be: valid, novel, potentially useful,

understandable

Classic data mining tasks

Classification:mining patterns that can classify future (new)

data into known classes.

Association rule miningmining any rule of the form X Y, where X

and Y are sets of data items.

Clusteringidentifying a set of similarity groups in the

data

CS583, Bing Liu, UIC 4

Classic data mining tasks (contd)

Sequential pattern mining:A sequential rule: A B, says that event A will

be immediately followed by event B with a certain confidence

Deviation detection: discovering the most significant changes in

data

Data visualization

Why is data mining important?

Huge amount of data How to make best use of data? Knowledge discovered from data can be used for

competitive advantage.

Many interesting things that one wants to find cannot be found using database queries, e.g.,“find people likely to buy my products”

6

Related fields

Data mining is an multi-disciplinary field:Machine learningStatisticsDatabasesInformation retrievalVisualizationNatural language processingetc.

Association Rule: Basic Concepts

Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires

and auto accessories also get automotive services done

Rule Measures: Support and Confidence

Find all the rules X Y with minimum confidence and support support, s, probability that a

transaction contains {X Y } confidence, c, conditional

probability that a transaction having X also contains Y

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have

A C (50%, 66.6%) C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

ApplicationsMarket basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets. Marketing: “people who bought this book also bought…” Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day. Shelf planning: given the “best sellers,” how do I organize my shelves?

Mining Frequent Itemsets: the Key Step

Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a

frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B}

should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

Use the frequent itemsets to generate association rules.

The Apriori Algorithm

Join Step: Ck is generated by joining Lk-1with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Example of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

Criticism to Support and Confidence

Example 1: (Aggarwal & Yu, PODS98) Among 5000 students

3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

Criticism to Support and Confidence (Cont.)

We need a measure of dependent or correlated events

If Corr < 1 A is negatively correlated with B (discourages B)If Corr > 1 A and B are positively correlatedP(AB)=P(A)P(B) if the itemsets are independent. (Corr = 1)P(B|A)/P(B) is also called the lift of rule A => B (we want positive lift!)

)(

)/(

)()(

)(, BP

ABP

BPAP

BAPcorr BA

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

The set of tuples used for model construction: training set The model is represented as classification rules, decision

trees, or mathematical formulae

Model usage: for classifying future or unknown objects

Estimate accuracy of the model The known label of test sample is compared with the

classified result from the model Accuracy rate is the percentage of test set samples

that are correctly classified by the model Test set is independent of training set, otherwise over-

fitting will occur

Classification by Decision Tree Induction

Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases Tree construction

At start, all the training examples are at the root Partition examples recursively based on selected

attributes Tree pruning

Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision

tree

Some probability...

Entropy

info(S) = - (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|)

S = cases freq(Ci,S) = # cases in S that belong to Ci

Prob(“this case belongs to Ci”) = freq(Ci,S)/|S|

Gain Assume attribute A divide set T into Ti. i

=1,…,m info(T_new) = |Ti|/S info(Ti) gain(A) = info (T) - info(T_new)

Example

Info(T) (9 play, 5 don’t) info(T) = -9/14log(9/14)- 5/14log(5/14) = 0.94 (bits)

Test: outlook

infoOutlook =

5/14 (-2/5 log(2/5)-3/5 log(3/5))+

4/14 (-4/4 log(4/4)) +

5/14 (-3/5 log(3/5) - 2/5 log(2/5))gainOutlook = 0.94-0.64= 0.3

= 0.64 (bits)

Test Windy

infowindy=

7/14(-4/7log(4/7)-3/7 log(3/7))

+7/14(-5/7log(5/7)-2/7log(2/(7))

= 0.278gainWindy = 0.94-0.278= 0.662

Windy is a better test

Outlook Temp Humidity Windy Classsunny 75 70 Y Playsunny 80 90 Y Don'tsunny 85 85 N Don'tsunny 72 95 N Don'tsunny 69 70 N Playovercast 72 90 Y Playovercast 83 78 N Playovercast 64 65 Y Playovercast 81 75 N Playrain 71 80 Y Don'train 65 70 Y Don'train 75 80 Y Playrain 68 80 N Playrain 70 96 N Play

Bayesian Classification: Why?

Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problemsIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilitiesStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem

Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem

MAP (maximum posteriori) hypothesis

Practical difficulty: require initial knowledge of many probabilities, significant computational cost

)()()|()|(

DPhPhDPDhP

.)()|(maxarg)|(maxarg hPhDPHh

DhPHhMAP

h

Naïve Bayes Classifier (I)

A simplified assumption: attributes are conditionally independent:

Greatly reduces the computation cost, only count the class distribution.

P V P Pj j i ji

n

C C v C( | ) ( ) ( | )

1

Naive Bayesian Classifier (II)

Given a training set, we can compute the probabilities

Outlook P N Humidity P Nsunny 2/9 3/5 high 3/9 4/5overcast 4/9 0 normal 6/9 1/5rain 3/9 2/5Tempreature Windyhot 2/9 2/5 true 3/9 3/5mild 4/9 2/5 false 6/9 2/5cool 3/9 1/5

3/9 x 3/9 x

Example

E ={outlook = sunny, temp = [64,70], humidity= [65,70], windy = y} =

{E1,E2,E3,E4}

Pr[“Play”/E] = (Pr[E1/Play] x Pr[E2/Play] x Pr[E3/Play] x Pr[E4/Play] x Pr[Play]) / Pr[E] =(2/9x 4/9x

Outlook Temp Humidity Windy Classsunny 75 70 Y Playsunny 80 90 Y Don'tsunny 85 85 N Don'tsunny 72 95 N Don'tsunny 69 70 N Playovercast 72 90 Y Playovercast 83 78 N Playovercast 64 65 Y Playovercast 81 75 N Playrain 71 80 Y Don'train 65 70 Y Don'train 75 80 Y Playrain 68 80 N Playrain 70 96 N Play

9/14)/Pr[E]= 0.007/Pr[E]

Pr[“Don’t”/E] = (3/5 x 2/5 x 1/5 x 3/5 x 5/14)/Pr[E] = 0.010/Pr[E]

With E: Pr[“Play”/E] = 41 %, Pr[“Don’t”/E] = 59 %

Bayesian Belief Networks (I)FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table for the variable LungCancer

What is Cluster Analysis?Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

Cluster analysis Grouping a set of data objects into clusters

Clustering is unsupervised classification: no predefined classesTypical applications As a stand-alone tool to get insight into data

distribution As a preprocessing step for other algorithms

Requirements of Clustering in Data Mining

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

Major Clustering Approaches

Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition

of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the

clusters and the idea is to find the best fit of that model to

each other

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented

by the center of the cluster k-medoids or PAM (Partition around medoids)

(Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the

clusters of the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

The K-Means Clustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Comments on the K-Means Method

Strength Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global

optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness Applicable only when mean is defined, then what

about categorical data? Need to specify k, the number of clusters, in

advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex

shapes

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

More on Hierarchical Clustering Methods

Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),

where n is the number of total objects can never undo what was done previously

Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally

adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from

the cluster and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling

Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as density-connected pointsMajor features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination

condition

Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98)

Grid-Based Clustering Method

Using multi-resolution grid data structure

Several methods STING (a STatistical INformation Grid

approach) by Wang, Yang and Muntz (1997)

WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)

Self-Similar Clustering Barbará & Chen (2000)

Model-Based Clustering Methods

Attempt to optimize the fit between the data and some mathematical modelStatistical and AI approach Conceptual clustering

A form of clustering in machine learning Produces a classification scheme for a set of unlabeled

objects Finds characteristic description for each concept (class)

COBWEB (Fisher’87) A popular a simple method of incremental conceptual

learning Creates a hierarchical clustering in the form of a

classification tree Each node refers to a concept and contains a

probabilistic description of that concept

COBWEB Clustering Method

A classification tree

Summary

Association rule and frequent set miningClassification: decision tree, bayesian network, SVM, etc.Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methodsOther data mining tasks

Documents

KDD Overview Xintao Wu. What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc