48
1 Introduction to Data Mining Introduction to Data Mining C C hapter 4 hapter 4

6107 Ch4 V2

Embed Size (px)

Citation preview

Page 1: 6107 Ch4 V2

1

Introduction to Data MiningIntroduction to Data Mining

CChapter 4hapter 4

Page 2: 6107 Ch4 V2

2

Chapter 4 OutlineChapter 4 Outline– BackgroundBackground– Information is PowerInformation is Power– Knowledge is PowerKnowledge is Power– Data MiningData Mining

Page 3: 6107 Ch4 V2

3

IntroductionIntroduction

Page 4: 6107 Ch4 V2

4

Information is PowerInformation is Power

RelevantRelevant Right InformationRight Information Globalised worldGlobalised world Vast amount of information availableVast amount of information available

Page 5: 6107 Ch4 V2

5

What is an informationWhat is an information

a collection of dataa collection of data The act of human analysis and The act of human analysis and

interpretation of activitiesinterpretation of activities Decomposing it into various Decomposing it into various

components and tackling themcomponents and tackling them

Page 6: 6107 Ch4 V2

6

What is Knowledge? What is Knowledge?

The act of human synthesis and The act of human synthesis and evaluation of informationevaluation of information

Integration of the relevant components Integration of the relevant components and form as a relevant whole system.and form as a relevant whole system.

Page 7: 6107 Ch4 V2

7

Lots of data is being collected Lots of data is being collected and warehoused and warehoused – Web data, e-commerceWeb data, e-commerce– purchases at department/purchases at department/

grocery storesgrocery stores– Bank/Credit Card Bank/Credit Card

transactionstransactions

Computers have become cheaper and more powerfulComputers have become cheaper and more powerful

Competitive Pressure is Strong Competitive Pressure is Strong – Provide better, customized services for an Provide better, customized services for an edge edge (e.g. in (e.g. in

Customer Relationship Management)Customer Relationship Management)

Why Mine Data? Commercial Why Mine Data? Commercial ViewpointViewpoint

Page 8: 6107 Ch4 V2

8

Why Mine Data? Scientific ViewpointWhy Mine Data? Scientific Viewpoint

Data collected and stored at Data collected and stored at enormous speeds (GB/hour)enormous speeds (GB/hour)

– remote sensors on a satelliteremote sensors on a satellite

– telescopes scanning the skiestelescopes scanning the skies

– microarrays generating gene microarrays generating gene expression dataexpression data

– scientific simulations scientific simulations generating terabytes of datagenerating terabytes of data

Traditional techniques infeasible for raw dataTraditional techniques infeasible for raw data Data mining may help scientists Data mining may help scientists

– in classifying and segmenting datain classifying and segmenting data– in Hypothesis Formationin Hypothesis Formation

Page 9: 6107 Ch4 V2

9

Data Mining Definition IData Mining Definition I

The nontrivial extraction of hidden, previously The nontrivial extraction of hidden, previously unidentified, and potentially valuable unidentified, and potentially valuable knowledge from dataknowledge from data

A variety of techniques such as neural A variety of techniques such as neural networks, decision trees or standard networks, decision trees or standard statistical techniques to identify nuggets of statistical techniques to identify nuggets of information or decision-making knowledge in information or decision-making knowledge in bodies of data, and extracting these in such a bodies of data, and extracting these in such a way that they can be put to use in areas such way that they can be put to use in areas such as decision support, prediction, forecasting, as decision support, prediction, forecasting, and estimation.and estimation.

Page 10: 6107 Ch4 V2

10

Data Mining Definition IIData Mining Definition II

Finding hidden information in a Finding hidden information in a databasedatabase

Page 11: 6107 Ch4 V2

11

Hidden InformationHidden Information

Number of years of experiencesNumber of years of experiences Great secret recipesGreat secret recipes Success FactorsSuccess Factors

Page 12: 6107 Ch4 V2

12

Draws ideas from machine learning/AI, Draws ideas from machine learning/AI, pattern recognition, statistics, and pattern recognition, statistics, and database systemsdatabase systems

Traditional TechniquesTraditional Techniquesmay be unsuitable due to may be unsuitable due to – Enormity of dataEnormity of data– High dimensionality High dimensionality

of dataof data– Heterogeneous, Heterogeneous,

distributed nature distributed nature of dataof data

Origins of Data MiningOrigins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 13: 6107 Ch4 V2

13

What is (not) Data Mining?What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

Page 14: 6107 Ch4 V2

14

Database Processing vs. Data Database Processing vs. Data Mining ProcessingMining Processing

QueryQuery– Well definedWell defined– SQLSQL

QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of databaseSubset of database

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database

Page 15: 6107 Ch4 V2

15

Query ExamplesQuery Examples DatabaseDatabase

Data MiningData Mining

– Find all customers who have purchased breadFind all customers who have purchased bread

– Find all items which are frequently purchased Find all items which are frequently purchased with bread. (association rules)with bread. (association rules)

– Find all credit applicants with surname name of Lee.Find all credit applicants with surname name of Lee.– Identify customers who have purchased more Identify customers who have purchased more than $100,000 in the last year.than $100,000 in the last year.

– Find all credit applicants who are good credit Find all credit applicants who are good credit risks. (classification)risks. (classification)– Identify customers with similar eating habits. Identify customers with similar eating habits. (Clustering)(Clustering)

Page 16: 6107 Ch4 V2

16

Data Mining Models and TasksData Mining Models and Tasks

Page 17: 6107 Ch4 V2

17

Classification: DefinitionClassification: Definition Given a collection of records (Given a collection of records (training set training set ))

– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the attributes is the attributes is the classclass..

Find a Find a modelmodel for class attribute as a function of the for class attribute as a function of the values of other attributes.values of other attributes.

Goal: Goal: previously unseenpreviously unseen records should be assigned records should be assigned a class as accurately as possible.a class as accurately as possible.– A A test settest set is used to determine the accuracy of the is used to determine the accuracy of the

model. Usually, the given data set is divided into model. Usually, the given data set is divided into training and test sets, with training set used to training and test sets, with training set used to build the model and test set used to validate it.build the model and test set used to validate it.

Page 18: 6107 Ch4 V2

18

Illustrating Classification TaskIllustrating Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Page 19: 6107 Ch4 V2

19

Examples of Classification Examples of Classification TaskTask

Predicting tumor cells as benign or malignantPredicting tumor cells as benign or malignant

Classifying credit card transactions Classifying credit card transactions as legitimate or fraudulentas legitimate or fraudulent

Classifying secondary structures of protein Classifying secondary structures of protein as alpha-helix, beta-sheet, or random as alpha-helix, beta-sheet, or random coilcoil

Categorizing news stories as finance, Categorizing news stories as finance, weather, entertainment, sports, etcweather, entertainment, sports, etc

Page 20: 6107 Ch4 V2

20

Classification TechniquesClassification Techniques

Decision Tree based MethodsDecision Tree based Methods Rule-based MethodsRule-based Methods Memory based reasoningMemory based reasoning Neural NetworksNeural Networks Naïve Bayes and Bayesian Belief Naïve Bayes and Bayesian Belief

NetworksNetworks Support Vector MachinesSupport Vector Machines

Page 21: 6107 Ch4 V2

21

Example of a Decision TreeExample of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 22: 6107 Ch4 V2

22

Another Example of Decision Another Example of Decision TreeTree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

classMarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

Page 23: 6107 Ch4 V2

23

Decision Tree Classification Decision Tree Classification TaskTask

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training SetDecision Tree

Page 24: 6107 Ch4 V2

24

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

Page 25: 6107 Ch4 V2

25

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 26: 6107 Ch4 V2

26

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 27: 6107 Ch4 V2

27

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 28: 6107 Ch4 V2

28

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 29: 6107 Ch4 V2

29

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Page 30: 6107 Ch4 V2

30

Decision Tree Classification Decision Tree Classification TaskTask

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

Page 31: 6107 Ch4 V2

31

What is Cluster Analysis?What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or similar (or related) to one another and different from (or unrelated to) the objects in other groupsunrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 32: 6107 Ch4 V2

32

Applications of Cluster Applications of Cluster AnalysisAnalysis

UnderstandingUnderstanding– Group related documents for browsing, group Group related documents for browsing, group

genes and proteins that have similar functionality, genes and proteins that have similar functionality, or group stocks with similar price fluctuationsor group stocks with similar price fluctuations

SummarizationSummarization– Reduce the size of large data setsReduce the size of large data sets

Page 33: 6107 Ch4 V2

33

What is not Cluster Analysis?What is not Cluster Analysis?

Supervised classificationSupervised classification– Have class label informationHave class label information

Simple segmentationSimple segmentation– Dividing students into different registration groups Dividing students into different registration groups

alphabetically, by last namealphabetically, by last name

Results of a queryResults of a query– Groupings are a result of an external specificationGroupings are a result of an external specification

Graph partitioningGraph partitioning– Some mutual relevance and synergy, but areas are Some mutual relevance and synergy, but areas are

not identicalnot identical

Page 34: 6107 Ch4 V2

34

Notion of a Cluster can be Notion of a Cluster can be AmbiguousAmbiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Page 35: 6107 Ch4 V2

35

Types of ClusteringsTypes of Clusterings

A A clusteringclustering is a set of clusters is a set of clusters

Important distinction between Important distinction between hierarchicalhierarchical and and partitionalpartitional sets of clusters sets of clusters

Partitional ClusteringPartitional Clustering– A division data objects into non-overlapping subsets A division data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one (clusters) such that each data object is in exactly one subsetsubset

Hierarchical clusteringHierarchical clustering– A set of nested clusters organized as a hierarchical A set of nested clusters organized as a hierarchical

tree tree

Page 36: 6107 Ch4 V2

36

Partitional ClusteringPartitional Clustering

Original Points A Partitional Clustering

Page 37: 6107 Ch4 V2

37

Hierarchical ClusteringHierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Page 38: 6107 Ch4 V2

38

Association RulesAssociation Rules Association Rules are a data mining technique and complement Association Rules are a data mining technique and complement

market basket analysis.market basket analysis. All association rules are unidirectional and take the following form:All association rules are unidirectional and take the following form:

Left-hand side rule IMPLIES Right-hand side ruleLeft-hand side rule IMPLIES Right-hand side rule

Both left hand side and the right-hand side of the rule may contain Both left hand side and the right-hand side of the rule may contain multiple items or combination of items such as following:multiple items or combination of items such as following:Yellow Peppers IMPLIES Red Peppers, Bananas, and BakeryYellow Peppers IMPLIES Red Peppers, Bananas, and Bakery

Associations are written as A B, where A is called antecedent or Associations are written as A B, where A is called antecedent or left-hand side(LHS) and B is called consequent or right-hand left-hand side(LHS) and B is called consequent or right-hand side(RHS).side(RHS).

– Ex: “If people buy printer then they buy catridge”» The antecedent is “buy printer” and the consequent is “buy

catridge”

Page 39: 6107 Ch4 V2

39

Association RulesAssociation Rules

Market Basket AnalysisMarket Basket Analysis-Necessary to have a list of transactions and -Necessary to have a list of transactions and what was purchased in each one.what was purchased in each one.-Ex:-Ex:Transaction 1: Frozen Pizza, Cola, MilkTransaction 1: Frozen Pizza, Cola, MilkTransaction 2: Milk, potato chips,Transaction 2: Milk, potato chips,Transaction 3: Cola, Frozen pizzaTransaction 3: Cola, Frozen pizzaTransaction 4: Milk, pretzelsTransaction 4: Milk, pretzelsTransaction 5: Cola, pretzelsTransaction 5: Cola, pretzels

Page 40: 6107 Ch4 V2

40

Association RulesAssociation Rules

Frozen Frozen PizzaPizza

MilkMilk ColaCola Potato Potato ChipsChips

PretzelsPretzels

Frozen PizzaFrozen Pizza 22 11 22 00 00

MilkMilk 11 33 11 11 11

ColaCola 22 11 33 00 11

Potato ChipsPotato Chips 00 11 00 11 00

PretzelsPretzels 00 11 11 00 22

Page 41: 6107 Ch4 V2

41

Association RulesAssociation Rules

Measures of AssociationMeasures of Association– SupportSupport- the support measure refers to the - the support measure refers to the

percentage of baskets in the analysis where the percentage of baskets in the analysis where the rule is true, that is where both the left-hand side rule is true, that is where both the left-hand side and the right-hand side of the association are and the right-hand side of the association are found.found.

– ConfidenceConfidence» The percentage of baskets from the analysis having the The percentage of baskets from the analysis having the

left-hand side item that also contain the right-hand side left-hand side item that also contain the right-hand side item is found via the confidence measure. This measure item is found via the confidence measure. This measure is different from support in that confidence is the is different from support in that confidence is the probability that the right-hand side item is present given probability that the right-hand side item is present given that we know the left-hand side item is in the basket.that we know the left-hand side item is in the basket.

» Calculated as a ratio:Calculated as a ratio:(frequency of A and B)/(frequency of A)(frequency of A and B)/(frequency of A)

Page 42: 6107 Ch4 V2

42

Association RulesAssociation Rules

Measures of AssociationMeasures of Association

-The support measure-The support measure• for the rulefor the rule

““Cola IMPLIES Frozen Pizza ” is 40%Cola IMPLIES Frozen Pizza ” is 40%

““Frozen Pizza IMPLIES Cola” is 40%Frozen Pizza IMPLIES Cola” is 40%• single itemsingle item

““Milk” is 60%Milk” is 60%

(Note: support considers only the combination and not the (Note: support considers only the combination and not the direction.)direction.)

Page 43: 6107 Ch4 V2

43

Association RulesAssociation Rules

Measures of AssociationMeasures of Association– ConfidenceConfidence

““Milk IMPLIES Potato Chips” has Milk IMPLIES Potato Chips” has confidence: confidence:

==(frequency of A and B)(frequency of A and B) / / (frequency of A)(frequency of A)

==20%20% / / 60%60%

= = 33%33%

Page 44: 6107 Ch4 V2

44

Data Mining vs. KDDData Mining vs. KDD

Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.

Data Mining:Data Mining: Use of algorithms to Use of algorithms to extract the information and patterns extract the information and patterns derived by the KDD process. derived by the KDD process.

Page 45: 6107 Ch4 V2

45

KDD ProcessKDD Process

Selection ( Pre-Mining 1):Selection ( Pre-Mining 1): Obtain data from various Obtain data from various sources.sources.

Preprocessing (Pre-Mining 2) :Preprocessing (Pre-Mining 2) : Cleanse data. Cleanse data. Transformation (Pre-Mining 3):Transformation (Pre-Mining 3): Convert to common Convert to common

format. Transform to new format.format. Transform to new format. Data Mining:Data Mining: Obtain desired results. Obtain desired results. Interpretation/Evaluation (Post-Mining):Interpretation/Evaluation (Post-Mining): Present Present

results to user in meaningful manner.results to user in meaningful manner.

Modified from [FPSS96C]

Page 46: 6107 Ch4 V2

46

KDD Process Ex: Web LogKDD Process Ex: Web Log Selection:Selection:

– Select log data (dates and locations) to useSelect log data (dates and locations) to use Preprocessing:Preprocessing:

– Remove identifying URLsRemove identifying URLs– Remove error logsRemove error logs

Transformation:Transformation: – Sessionize (sort and group)Sessionize (sort and group)

Data Mining:Data Mining: – Identify and count patternsIdentify and count patterns– Construct data structureConstruct data structure

Interpretation/Evaluation:Interpretation/Evaluation:– Identify and display frequently accessed sequences.Identify and display frequently accessed sequences.

Potential User Applications:Potential User Applications:– Cache predictionCache prediction– PersonalisationPersonalisation

Page 47: 6107 Ch4 V2

47

Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines

•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis

•Neural Networks•Decision Tree Algorithms

•Algorithm Design Techniques•Algorithm Analysis•Data Structures

•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques

Page 48: 6107 Ch4 V2

48

Data mining: What it can’t do

tell the value of the patterns to the tell the value of the patterns to the organizationorganization

replace skilled business analysts or replace skilled business analysts or managersmanagers

automatically discover solutions without automatically discover solutions without guidanceguidance