44
DATA MINING 1

DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

DATA MINING

1

Page 2: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Introduction

2

Page 3: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

What Is Data Mining?

• Data mining (knowledge discovery from data) •  Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of data

• Alternative names •  Knowledge discovery (mining) in databases (KDD), •  Knowledge extraction, •  Data/pattern analysis, •  Information harvesting, •  Business intelligence

3

Page 4: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Usage of Data Mining: Real-World Apps I • Play-by-play information recorded by teams

•  Who is on the court •  Who shoots •  Results

• Coaches want to know what works best •  Plays that work well against a given team •  Good/bad player matchups

Data

Knowledge

4

Page 5: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Usage of Data Mining: Real-World Apps I

• Advanced Scout (from IBM Research) is a data mining tool to answer these questions

0 20 40 60

Overall ShootingPercentage

Starks+Houston+ Ward playing

5

Page 6: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Usage of Data Mining: Real-World Apps 2 • Assume players X and Y

• All statistics from previous matches is there

• What we need to know is •  What factors should each player focus on to win

the game •  What are the weaknesses of the other side •  ….

6

Page 7: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Usage of Data Mining: Real-World Apps 2 •  IBM has data mining tools for analyzing

tennis data and converting it into knowledge

7

Page 8: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Usage of Data Mining: Real-World Apps 3 •  Items and customers transactions

•  Which items are bought together •  When the items are bought •  Quantities

• Owners need to know •  Are there peak seasons for specific items? •  Which items should be put next to each other? •  If make discount on item X, should we make discount

on other items?

Data

Knowledge

8

Page 9: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Usage of Data Mining: Real-World Apps 3

One piece of knowledge another piece of

knowledge

If you put discount on A and B, do not put on C

9

Page 10: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Data Mining: Name

•  The process of discovering meaningful new correlations, patterns, and trends from large amounts of stored data.

Data Mining Knowledge Mining

Knowledge Discovery in Databases

Data Archaeology

Data Dredging

Database Mining Knowledge Extraction

Data Pattern Processing

Information Harvesting

Siftware

10

Page 11: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Integration of Multiple Technology

Machine Learning

Database Management

Artificial Intelligence

Statistics

Data Mining

Visualization Algorithms

11

Page 12: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Data Mining: Classification Schemes • General functionality

• Descriptive data mining • Predictive data mining

• Different views, different classifications • Kinds of data to be mined • Kinds of knowledge to be discovered • Kinds of techniques utilized • Kinds of applications adapted

12

Page 13: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

Data Target Data

Selection

Knowledge

Preprocessed Data

Patterns

Data Mining

Interpretation/ Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

13

Page 14: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

What Can Data Mining Do? • Cluster • Classify

• Categorical, Regression • Summarize

• Summary statistics, Summary rules • Link Analysis / Model Dependencies

• Association rules • Sequence analysis

•  Time-series analysis, Sequential associations • Detect Deviations (Outliers)

14

Page 15: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Clustering •  Find groups of similar data items •  Statistical techniques require

some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions

Uses: •  Demographic analysis Technologies: •  Self-Organizing Maps •  Probability Densities •  Conceptual Clustering

“Group people with similar travel profiles” • George, Patricia •  Jeff, Evelyn, Chris • Rob

Clusters

15

Page 16: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Classification •  Find ways to separate data

items into pre-defined groups •  We know X and Y belong

together, find other things in same group

•  Requires “training data”: Data items where group is known

Uses: •  Profiling Technologies: •  Generate decision trees (results

are human understandable) •  Neural Nets

“Route documents to most likely interested parties” • English or non-english? • Domestic or Foreign?

Groups

Training Data

tool produces

classifier

16

Page 17: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rules •  Identify dependencies in the

data: •  X makes Y likely

•  Indicate significance of each dependency

•  Bayesian methods Uses: •  Targeted marketing Technologies: •  AIS, SETM, Hugin, TETRAD II

“Find groups of items commonly purchased together” •  People who purchase fish are

extraordinarily likely to purchase wine

•  People who purchase Turkey are extraordinarily likely to purchase cranberries

Date/Time/Register Fish Turkey Cranberries Wine …12/6 13:15 2 N Y Y Y …12/6 13:16 3 Y N N Y …

17

Page 18: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Deviation Detection (Outlier Detection) •  Find unexpected values, outliers Uses: •  Failure analysis •  Anomaly discovery for analysis Technologies: •  clustering/classification methods •  Statistical techniques •  visualization

• “Find unusual occurrences in IBM stock prices”

Date Close Volume Spread58/07/02 369.50 314.08 .02256158/07/03 369.25 313.87 .02256158/07/04 Market Closed58/07/07 370.00 314.50 .022561

Sample date Event Occurrences58/07/04 Market closed 317 times59/01/06 2.5% dividend 2 times59/04/04 50% stock split 7 times73/10/09 not traded 1 time

18

Page 19: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Architecture: Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

19

Page 20: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

What To Cover

• Frequent Itemset Mining

• Association Rule Mining

• Clustering

• Classification

• Deviation (Outlier) Detection

20

Page 21: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Frequent Itemset Mining

21

Page 22: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Frequent Itemset Mining

•  Very common problem in Market-Basket applications

•  Given a set of items I ={milk, bread, jelly, …}

•  Given a set of transactions where each transaction contains subset of items •  t1 = {milk, bread, water} •  t2 = {milk, nuts, butter, rice}

22

Page 23: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Frequent Pattern Mining •  Given a set of items I ={milk, bread, jelly, …} •  Given a set of transactions where each transaction contains

subset of items •  t1 = {milk, bread, water} •  t2 = {milk, nuts, butter, rice}

What are the itemsets frequently sold together ??

% of transactions in which the itemset appears >= α

23

Page 24: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Example

Assume α = 60%, what are the frequent itemsets

•  {Bread} à 80% •  {PeanutButter} à 60% •  {Bread, PeanutButter} à 60%

called “Support”

24

Page 25: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

How to find frequent itemsets

• Naïve Approach •  Enumerate all possible itemsets and then count each one

All possible itemsets of size 1

All possible itemsets of size 2

All possible itemsets of size 3

All possible itemsets of size 4

25

Page 26: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Can we optimize??

Assume α = 60%, what are the frequent itemsets

•  {Bread} à 80% •  {PeanutButter} à 60% •  {Bread, PeanutButter} à 60%

called “Support”

Property For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of size n-1 must be frequent as well

26

Page 27: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Apriori Algorithm •  Executes in scans (iterations), each scan has two phases

•  Given a list of candidate itemsets of size n, count their appearance and find frequent ones

•  From the frequent ones generate candidates of size n+1 (previous property must hold) •  All subsets of size n must be frequent to be a candidate

•  Start the algorithm where n =1, then repeat

Use the property reduce the number of itemsets to check

27

Page 28: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Apriori Example

28

Page 29: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Apriori Example (Cont’d)

29

Page 30: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

The Apriori Algorithm — Example 2

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

30

Page 31: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Apriori with Constraints •  If we have constraints, e.g., Sum(price) of the frequent

group should not exceed X

•  Lazy Approach •  Apply the constraints at the end on the discovered patterns

• Eager Approach •  Push the constraints during the computations (if possible)

31

Page 32: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Lazy Algorithm: Apriori + Constraint

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Constraint:

Sum{S.price < 5}

Assume price = ItemID

32

Page 33: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Eager Algorithm (Crossed entries are not computed)

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Constraint:

Sum{S.price < 5}

Assume price = ItemID

33

Page 34: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Apriori Adv/Disadv • Advantages:

•  Uses large itemset property. •  Easily parallelized •  Easy to implement.

• Disadvantages: •  Assumes transaction database is memory resident. •  Requires up to m database scans.

34

Page 35: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rule Mining

35

Page 36: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rules Outline

•  Finding associations between the objects in the database

• When X happens, Y also happens with probability … •  If the probability is high, then the association between X, Y is

important

• What is the probability when a customer buys bread in a transaction, (s)he also buys milk in the same transaction?

36

Page 37: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Example: Market Basket Data •  Items frequently purchased together:

Bread ⇒PeanutButter • Uses:

•  Placement •  Advertising •  Sales •  Coupons

• Objective: increase sales and reduce costs

37

Page 38: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rule Definitions • Set of items: I={I1,I2,…,Im} •  Transactions: D={t1,t2, …, tn}, tj⊆ I •  Itemset: {Ii1,Ii2, …, Iik} ⊆ I • Support of an itemset: Percentage of transactions which

contain that itemset. •  Large (Frequent) itemset: Itemset whose number of

occurrences is above a threshold.

38

Page 39: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

39

Page 40: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rule Definitions • Association Rule (AR): implication X ⇒ Y where X,Y ⊆ I and X ∩ Y = ;

• Support of AR (s) X ⇒ Y: Percentage of transactions that contain X ∪Y

• Confidence of AR (α) X ⇒ Y: Ratio of number of transactions that contain X ∪ Y to the number that contain X

40

Page 41: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rules Ex (cont’d)

41

Page 42: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of

transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.

•  Link Analysis • NOTE: Support of X ⇒ Y is same as support of X ∪ Y.

42

Page 43: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Association Rule Techniques 1.  Find Large Itemsets. 2.  Generate rules from frequent itemsets.

43

Page 44: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,

Example

44

Rule: Bread à PeanutButter •  Support of rule = support(Bread, PeanutButter) = 60% •  Confidence of rule = support(Bread, PeanutButter)/support(Bread) = 75%

Rule: Bread, Jelly à PeanutButter •  Support of rule = support(Bread, Jelly, PeanutButter) = 20% •  Confidence of rule = support(Bread, Jelly, PeanutButter) /support(Bread, Jelly) = 100%

Usually we search for rules: Support > α Confidence > β