DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? •...

Preview:

Citation preview

DATA MINING

1

Introduction

2

What Is Data Mining?

• Data mining (knowledge discovery from data) •  Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of data

• Alternative names •  Knowledge discovery (mining) in databases (KDD), •  Knowledge extraction, •  Data/pattern analysis, •  Information harvesting, •  Business intelligence

3

Usage of Data Mining: Real-World Apps I • Play-by-play information recorded by teams

•  Who is on the court •  Who shoots •  Results

• Coaches want to know what works best •  Plays that work well against a given team •  Good/bad player matchups

Data

Knowledge

4

Usage of Data Mining: Real-World Apps I

• Advanced Scout (from IBM Research) is a data mining tool to answer these questions

0 20 40 60

Overall ShootingPercentage

Starks+Houston+ Ward playing

5

Usage of Data Mining: Real-World Apps 2 • Assume players X and Y

• All statistics from previous matches is there

• What we need to know is •  What factors should each player focus on to win

the game •  What are the weaknesses of the other side •  ….

6

Usage of Data Mining: Real-World Apps 2 •  IBM has data mining tools for analyzing

tennis data and converting it into knowledge

7

Usage of Data Mining: Real-World Apps 3 •  Items and customers transactions

•  Which items are bought together •  When the items are bought •  Quantities

• Owners need to know •  Are there peak seasons for specific items? •  Which items should be put next to each other? •  If make discount on item X, should we make discount

on other items?

Data

Knowledge

8

Usage of Data Mining: Real-World Apps 3

One piece of knowledge another piece of

knowledge

If you put discount on A and B, do not put on C

9

Data Mining: Name

•  The process of discovering meaningful new correlations, patterns, and trends from large amounts of stored data.

Data Mining Knowledge Mining

Knowledge Discovery in Databases

Data Archaeology

Data Dredging

Database Mining Knowledge Extraction

Data Pattern Processing

Information Harvesting

Siftware

10

Integration of Multiple Technology

Machine Learning

Database Management

Artificial Intelligence

Statistics

Data Mining

Visualization Algorithms

11

Data Mining: Classification Schemes • General functionality

• Descriptive data mining • Predictive data mining

• Different views, different classifications • Kinds of data to be mined • Kinds of knowledge to be discovered • Kinds of techniques utilized • Kinds of applications adapted

12

adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

Data Target Data

Selection

Knowledge

Preprocessed Data

Patterns

Data Mining

Interpretation/ Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

13

What Can Data Mining Do? • Cluster • Classify

• Categorical, Regression • Summarize

• Summary statistics, Summary rules • Link Analysis / Model Dependencies

• Association rules • Sequence analysis

•  Time-series analysis, Sequential associations • Detect Deviations (Outliers)

14

Clustering •  Find groups of similar data items •  Statistical techniques require

some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions

Uses: •  Demographic analysis Technologies: •  Self-Organizing Maps •  Probability Densities •  Conceptual Clustering

“Group people with similar travel profiles” • George, Patricia •  Jeff, Evelyn, Chris • Rob

Clusters

15

Classification •  Find ways to separate data

items into pre-defined groups •  We know X and Y belong

together, find other things in same group

•  Requires “training data”: Data items where group is known

Uses: •  Profiling Technologies: •  Generate decision trees (results

are human understandable) •  Neural Nets

“Route documents to most likely interested parties” • English or non-english? • Domestic or Foreign?

Groups

Training Data

tool produces

classifier

16

Association Rules •  Identify dependencies in the

data: •  X makes Y likely

•  Indicate significance of each dependency

•  Bayesian methods Uses: •  Targeted marketing Technologies: •  AIS, SETM, Hugin, TETRAD II

“Find groups of items commonly purchased together” •  People who purchase fish are

extraordinarily likely to purchase wine

•  People who purchase Turkey are extraordinarily likely to purchase cranberries

Date/Time/Register Fish Turkey Cranberries Wine …12/6 13:15 2 N Y Y Y …12/6 13:16 3 Y N N Y …

17

Deviation Detection (Outlier Detection) •  Find unexpected values, outliers Uses: •  Failure analysis •  Anomaly discovery for analysis Technologies: •  clustering/classification methods •  Statistical techniques •  visualization

• “Find unusual occurrences in IBM stock prices”

Date Close Volume Spread58/07/02 369.50 314.08 .02256158/07/03 369.25 313.87 .02256158/07/04 Market Closed58/07/07 370.00 314.50 .022561

Sample date Event Occurrences58/07/04 Market closed 317 times59/01/06 2.5% dividend 2 times59/04/04 50% stock split 7 times73/10/09 not traded 1 time

18

Architecture: Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

19

What To Cover

• Frequent Itemset Mining

• Association Rule Mining

• Clustering

• Classification

• Deviation (Outlier) Detection

20

Frequent Itemset Mining

21

Frequent Itemset Mining

•  Very common problem in Market-Basket applications

•  Given a set of items I ={milk, bread, jelly, …}

•  Given a set of transactions where each transaction contains subset of items •  t1 = {milk, bread, water} •  t2 = {milk, nuts, butter, rice}

22

Frequent Pattern Mining •  Given a set of items I ={milk, bread, jelly, …} •  Given a set of transactions where each transaction contains

subset of items •  t1 = {milk, bread, water} •  t2 = {milk, nuts, butter, rice}

What are the itemsets frequently sold together ??

% of transactions in which the itemset appears >= α

23

Example

Assume α = 60%, what are the frequent itemsets

•  {Bread} à 80% •  {PeanutButter} à 60% •  {Bread, PeanutButter} à 60%

called “Support”

24

How to find frequent itemsets

• Naïve Approach •  Enumerate all possible itemsets and then count each one

All possible itemsets of size 1

All possible itemsets of size 2

All possible itemsets of size 3

All possible itemsets of size 4

25

Can we optimize??

Assume α = 60%, what are the frequent itemsets

•  {Bread} à 80% •  {PeanutButter} à 60% •  {Bread, PeanutButter} à 60%

called “Support”

Property For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of size n-1 must be frequent as well

26

Apriori Algorithm •  Executes in scans (iterations), each scan has two phases

•  Given a list of candidate itemsets of size n, count their appearance and find frequent ones

•  From the frequent ones generate candidates of size n+1 (previous property must hold) •  All subsets of size n must be frequent to be a candidate

•  Start the algorithm where n =1, then repeat

Use the property reduce the number of itemsets to check

27

Apriori Example

28

Apriori Example (Cont’d)

29

The Apriori Algorithm — Example 2

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

30

Apriori with Constraints •  If we have constraints, e.g., Sum(price) of the frequent

group should not exceed X

•  Lazy Approach •  Apply the constraints at the end on the discovered patterns

• Eager Approach •  Push the constraints during the computations (if possible)

31

Lazy Algorithm: Apriori + Constraint

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Constraint:

Sum{S.price < 5}

Assume price = ItemID

32

Eager Algorithm (Crossed entries are not computed)

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Constraint:

Sum{S.price < 5}

Assume price = ItemID

33

Apriori Adv/Disadv • Advantages:

•  Uses large itemset property. •  Easily parallelized •  Easy to implement.

• Disadvantages: •  Assumes transaction database is memory resident. •  Requires up to m database scans.

34

Association Rule Mining

35

Association Rules Outline

•  Finding associations between the objects in the database

• When X happens, Y also happens with probability … •  If the probability is high, then the association between X, Y is

important

• What is the probability when a customer buys bread in a transaction, (s)he also buys milk in the same transaction?

36

Example: Market Basket Data •  Items frequently purchased together:

Bread ⇒PeanutButter • Uses:

•  Placement •  Advertising •  Sales •  Coupons

• Objective: increase sales and reduce costs

37

Association Rule Definitions • Set of items: I={I1,I2,…,Im} •  Transactions: D={t1,t2, …, tn}, tj⊆ I •  Itemset: {Ii1,Ii2, …, Iik} ⊆ I • Support of an itemset: Percentage of transactions which

contain that itemset. •  Large (Frequent) itemset: Itemset whose number of

occurrences is above a threshold.

38

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

39

Association Rule Definitions • Association Rule (AR): implication X ⇒ Y where X,Y ⊆ I and X ∩ Y = ;

• Support of AR (s) X ⇒ Y: Percentage of transactions that contain X ∪Y

• Confidence of AR (α) X ⇒ Y: Ratio of number of transactions that contain X ∪ Y to the number that contain X

40

Association Rules Ex (cont’d)

41

Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of

transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.

•  Link Analysis • NOTE: Support of X ⇒ Y is same as support of X ∪ Y.

42

Association Rule Techniques 1.  Find Large Itemsets. 2.  Generate rules from frequent itemsets.

43

Example

44

Rule: Bread à PeanutButter •  Support of rule = support(Bread, PeanutButter) = 60% •  Confidence of rule = support(Bread, PeanutButter)/support(Bread) = 75%

Rule: Bread, Jelly à PeanutButter •  Support of rule = support(Bread, Jelly, PeanutButter) = 20% •  Confidence of rule = support(Bread, Jelly, PeanutButter) /support(Bread, Jelly) = 100%

Usually we search for rules: Support > α Confidence > β

Recommended