DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? •...

DATA MINING

Introduction

What Is Data Mining?

• Data mining (knowledge discovery from data) •  Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of data

• Alternative names •  Knowledge discovery (mining) in databases (KDD), •  Knowledge extraction, •  Data/pattern analysis, •  Information harvesting, •  Business intelligence

Usage of Data Mining: Real-World Apps I • Play-by-play information recorded by teams

•  Who is on the court •  Who shoots •  Results

• Coaches want to know what works best •  Plays that work well against a given team •  Good/bad player matchups

Knowledge

Usage of Data Mining: Real-World Apps I

• Advanced Scout (from IBM Research) is a data mining tool to answer these questions

0 20 40 60

Overall ShootingPercentage

Starks+Houston+ Ward playing

Usage of Data Mining: Real-World Apps 2 • Assume players X and Y

• All statistics from previous matches is there

• What we need to know is •  What factors should each player focus on to win

the game •  What are the weaknesses of the other side •  ….

Usage of Data Mining: Real-World Apps 2 •  IBM has data mining tools for analyzing

tennis data and converting it into knowledge

Usage of Data Mining: Real-World Apps 3 •  Items and customers transactions

•  Which items are bought together •  When the items are bought •  Quantities

• Owners need to know •  Are there peak seasons for specific items? •  Which items should be put next to each other? •  If make discount on item X, should we make discount

on other items?

Knowledge

Usage of Data Mining: Real-World Apps 3

One piece of knowledge another piece of

knowledge

If you put discount on A and B, do not put on C

Data Mining: Name

•  The process of discovering meaningful new correlations, patterns, and trends from large amounts of stored data.

Data Mining Knowledge Mining

Knowledge Discovery in Databases

Data Archaeology

Data Dredging

Database Mining Knowledge Extraction

Data Pattern Processing

Information Harvesting

Siftware

Integration of Multiple Technology

Machine Learning

Database Management

Artificial Intelligence

Statistics

Data Mining

Visualization Algorithms

Data Mining: Classification Schemes • General functionality

• Descriptive data mining • Predictive data mining

• Different views, different classifications • Kinds of data to be mined • Kinds of knowledge to be discovered • Kinds of techniques utilized • Kinds of applications adapted

adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

Data Target Data

Selection

Knowledge

Preprocessed Data

Patterns

Data Mining

Interpretation/ Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

What Can Data Mining Do? • Cluster • Classify

• Categorical, Regression • Summarize

• Summary statistics, Summary rules • Link Analysis / Model Dependencies

• Association rules • Sequence analysis

•  Time-series analysis, Sequential associations • Detect Deviations (Outliers)

Clustering •  Find groups of similar data items •  Statistical techniques require

some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions

Uses: •  Demographic analysis Technologies: •  Self-Organizing Maps •  Probability Densities •  Conceptual Clustering

“Group people with similar travel profiles” • George, Patricia •  Jeff, Evelyn, Chris • Rob

Clusters

Classification •  Find ways to separate data

items into pre-defined groups •  We know X and Y belong

together, find other things in same group

•  Requires “training data”: Data items where group is known

Uses: •  Profiling Technologies: •  Generate decision trees (results

are human understandable) •  Neural Nets

“Route documents to most likely interested parties” • English or non-english? • Domestic or Foreign?

Groups

Training Data

tool produces

classifier

Association Rules •  Identify dependencies in the

data: •  X makes Y likely

•  Indicate significance of each dependency

•  Bayesian methods Uses: •  Targeted marketing Technologies: •  AIS, SETM, Hugin, TETRAD II

“Find groups of items commonly purchased together” •  People who purchase fish are

extraordinarily likely to purchase wine

•  People who purchase Turkey are extraordinarily likely to purchase cranberries

Date/Time/Register Fish Turkey Cranberries Wine …12/6 13:15 2 N Y Y Y …12/6 13:16 3 Y N N Y …

Deviation Detection (Outlier Detection) •  Find unexpected values, outliers Uses: •  Failure analysis •  Anomaly discovery for analysis Technologies: •  clustering/classification methods •  Statistical techniques •  visualization

• “Find unusual occurrences in IBM stock prices”

Date Close Volume Spread58/07/02 369.50 314.08 .02256158/07/03 369.25 313.87 .02256158/07/04 Market Closed58/07/07 370.00 314.50 .022561

Sample date Event Occurrences58/07/04 Market closed 317 times59/01/06 2.5% dividend 2 times59/04/04 50% stock split 7 times73/10/09 not traded 1 time

Architecture: Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

What To Cover

• Frequent Itemset Mining

• Association Rule Mining

• Clustering

• Classification

• Deviation (Outlier) Detection

Frequent Itemset Mining

•  Very common problem in Market-Basket applications

•  Given a set of items I ={milk, bread, jelly, …}

•  Given a set of transactions where each transaction contains subset of items •  t1 = {milk, bread, water} •  t2 = {milk, nuts, butter, rice}

Frequent Pattern Mining •  Given a set of items I ={milk, bread, jelly, …} •  Given a set of transactions where each transaction contains

subset of items •  t1 = {milk, bread, water} •  t2 = {milk, nuts, butter, rice}

What are the itemsets frequently sold together ??

% of transactions in which the itemset appears >= α

Example

Assume α = 60%, what are the frequent itemsets

•  {Bread} à 80% •  {PeanutButter} à 60% •  {Bread, PeanutButter} à 60%

called “Support”

How to find frequent itemsets

• Naïve Approach •  Enumerate all possible itemsets and then count each one

All possible itemsets of size 1

Can we optimize??

Assume α = 60%, what are the frequent itemsets

•  {Bread} à 80% •  {PeanutButter} à 60% •  {Bread, PeanutButter} à 60%

called “Support”

Property For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of size n-1 must be frequent as well

Apriori Algorithm •  Executes in scans (iterations), each scan has two phases

•  Given a list of candidate itemsets of size n, count their appearance and find frequent ones

•  From the frequent ones generate candidates of size n+1 (previous property must hold) •  All subsets of size n must be frequent to be a candidate

•  Start the algorithm where n =1, then repeat

Use the property reduce the number of itemsets to check

Apriori Example

Apriori Example (Cont’d)

The Apriori Algorithm — Example 2

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

C2 C2 Scan D

C3 L3 itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Apriori with Constraints •  If we have constraints, e.g., Sum(price) of the frequent

group should not exceed X

•  Lazy Approach •  Apply the constraints at the end on the discovered patterns

• Eager Approach •  Push the constraints during the computations (if possible)

Lazy Algorithm: Apriori + Constraint

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

C2 C2 Scan D

Constraint:

Sum{S.price < 5}

Assume price = ItemID

Eager Algorithm (Crossed entries are not computed)

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

C2 C2 Scan D

Constraint:

Sum{S.price < 5}

Assume price = ItemID

Apriori Adv/Disadv • Advantages:

•  Uses large itemset property. •  Easily parallelized •  Easy to implement.

• Disadvantages: •  Assumes transaction database is memory resident. •  Requires up to m database scans.

Association Rule Mining

Association Rules Outline

•  Finding associations between the objects in the database

• When X happens, Y also happens with probability … •  If the probability is high, then the association between X, Y is

important

• What is the probability when a customer buys bread in a transaction, (s)he also buys milk in the same transaction?

Example: Market Basket Data •  Items frequently purchased together:

Bread ⇒PeanutButter • Uses:

•  Placement •  Advertising •  Sales •  Coupons

• Objective: increase sales and reduce costs

Association Rule Definitions • Set of items: I={I1,I2,…,Im} •  Transactions: D={t1,t2, …, tn}, tj⊆ I •  Itemset: {Ii1,Ii2, …, Iik} ⊆ I • Support of an itemset: Percentage of transactions which

contain that itemset. •  Large (Frequent) itemset: Itemset whose number of

occurrences is above a threshold.

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Association Rule Definitions • Association Rule (AR): implication X ⇒ Y where X,Y ⊆ I and X ∩ Y = ;

• Support of AR (s) X ⇒ Y: Percentage of transactions that contain X ∪Y

• Confidence of AR (α) X ⇒ Y: Ratio of number of transactions that contain X ∪ Y to the number that contain X

Association Rules Ex (cont’d)

Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of

transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.

•  Link Analysis • NOTE: Support of X ⇒ Y is same as support of X ∪ Y.

Association Rule Techniques 1.  Find Large Itemsets. 2.  Generate rules from frequent itemsets.

Example

Rule: Bread à PeanutButter •  Support of rule = support(Bread, PeanutButter) = 60% •  Confidence of rule = support(Bread, PeanutButter)/support(Bread) = 75%

Rule: Bread, Jelly à PeanutButter •  Support of rule = support(Bread, Jelly, PeanutButter) = 20% •  Confidence of rule = support(Bread, Jelly, PeanutButter) /support(Bread, Jelly) = 100%

Usually we search for rules: Support > α Confidence > β

DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W3/DataMining-1.pdf · What Is Data Mining? •...

Documents

September 4, 20151 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality

Data Analysis, Interpretation, & Presentation: Lies, Damn Lies, and Statistics CS561

Hadoop: A Framework for Data- Intensive Distributed Computingcs561/s12/Lectures/6/Hadoop.pdf · Hadoop: A Framework for Data-Intensive Distributed Computing CS561-Spring 2012 WPI,

Data Mining Introduction to Data Mining

EE3J2 Data Mining EE3J2 Data Mining

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2014 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

Data Mining: Introduction. Chapter 1. Introduction zMotivation: Why data mining? zWhat is data mining? zData Mining: On what kind of data? zData mining

Massive Data Analytics Data Mining Introductiondisi.unitn.it/.../MassiveDataAnalytics/slides/DataMiningIntro-2in1.pdfMassive Data Analytics Data Mining Introduction ... Data mining,

Applied Data Mining - Lagout Mining/Applied Data Mining [Xu... · The book reviews applied data mining from theoretical basis to ... data mining application arenas, ... research topics

Data Mining vs. Statistics Pavel Brusilovsky. 2 Objectives 2 Intro to Data Mining Data Mining vs. Statistics Data Mining vs. Text Mining Applications

Secure Data Processing - uni-leipzig.de · PRIVACY-PRESERVING DATA MINING 12 Local Data Local Data Local Data Warehouse Data Mining Local Data Mining Local Data Mining Combiner Local

SEMI-STRUCTURED DATA (XML) - Academics | WPIweb.cs.wpi.edu/~cs561/s12/Lectures/XML/XML.pdfSEMI-STRUCTURED DATA ... Abiteboul, Hull, Vianu Addison Wesley, 1995

LimitationsandOpportunities - WPIweb.cs.wpi.edu/~cs561/s12/Lectures/presentations/DataManagement...• Features!aDBMS!should!implementfor!large!scale!dataanalysis!!! ... • CharacteriscsofTransac’onalDataManagement:

OLAP & DATA MINING - Computer Scienceweb.cs.wpi.edu/~cs561/s12/Lectures/IntegrationOLAP/OLAPandMinin… · OLAP & DATA MINING 1 . Online Analytic Processing OLAP 2 . ... Integration

UNIT - I Data Mining. UNIT - I Introduction : Fundamentals of data mining, Data Mining Functionalities, Classification of Data Mining systems, Major issues

CS561-SPRING 2012 WPI, MOHAMED ELTABAKH CS561- ADVANCED TOPICS IN DATABASE SYSTEMS 1 INTRODUCTION & LOGISTICS

Gufran Ahmad. Contents What is Data Mining? Data Mining / KDD process Different aspects of Data Mining Why Data Mining? Data Mining in Business Examples

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2010 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

Scalable Trigger Processing Discussion of publication by Eric N. Hanson et al Int Conf Data Engineering 1999 CS561

1 Introduction to XML Algebra CS561. 2 Data Model data model ~ core data structures and data types supported by DBMS relational database is a table (set-oriented)