6107 Ch4 V2

1

Introduction to Data MiningIntroduction to Data Mining

CChapter 4hapter 4

2

Chapter 4 OutlineChapter 4 Outline– BackgroundBackground– Information is PowerInformation is Power– Knowledge is PowerKnowledge is Power– Data MiningData Mining

3

IntroductionIntroduction

4

Information is PowerInformation is Power

RelevantRelevant Right InformationRight Information Globalised worldGlobalised world Vast amount of information availableVast amount of information available

5

What is an informationWhat is an information

a collection of dataa collection of data The act of human analysis and The act of human analysis and

interpretation of activitiesinterpretation of activities Decomposing it into various Decomposing it into various

components and tackling themcomponents and tackling them

6

What is Knowledge? What is Knowledge?

The act of human synthesis and The act of human synthesis and evaluation of informationevaluation of information

Integration of the relevant components Integration of the relevant components and form as a relevant whole system.and form as a relevant whole system.

7

Lots of data is being collected Lots of data is being collected and warehoused and warehoused – Web data, e-commerceWeb data, e-commerce– purchases at department/purchases at department/

grocery storesgrocery stores– Bank/Credit Card Bank/Credit Card

transactionstransactions

Computers have become cheaper and more powerfulComputers have become cheaper and more powerful

Competitive Pressure is Strong Competitive Pressure is Strong – Provide better, customized services for an Provide better, customized services for an edge edge (e.g. in (e.g. in

Customer Relationship Management)Customer Relationship Management)

Why Mine Data? Commercial Why Mine Data? Commercial ViewpointViewpoint

8

Why Mine Data? Scientific ViewpointWhy Mine Data? Scientific Viewpoint

Data collected and stored at Data collected and stored at enormous speeds (GB/hour)enormous speeds (GB/hour)

– remote sensors on a satelliteremote sensors on a satellite

– telescopes scanning the skiestelescopes scanning the skies

– microarrays generating gene microarrays generating gene expression dataexpression data

– scientific simulations scientific simulations generating terabytes of datagenerating terabytes of data

Traditional techniques infeasible for raw dataTraditional techniques infeasible for raw data Data mining may help scientists Data mining may help scientists

– in classifying and segmenting datain classifying and segmenting data– in Hypothesis Formationin Hypothesis Formation

9

Data Mining Definition IData Mining Definition I

The nontrivial extraction of hidden, previously The nontrivial extraction of hidden, previously unidentified, and potentially valuable unidentified, and potentially valuable knowledge from dataknowledge from data

A variety of techniques such as neural A variety of techniques such as neural networks, decision trees or standard networks, decision trees or standard statistical techniques to identify nuggets of statistical techniques to identify nuggets of information or decision-making knowledge in information or decision-making knowledge in bodies of data, and extracting these in such a bodies of data, and extracting these in such a way that they can be put to use in areas such way that they can be put to use in areas such as decision support, prediction, forecasting, as decision support, prediction, forecasting, and estimation.and estimation.

10

Data Mining Definition IIData Mining Definition II

Finding hidden information in a Finding hidden information in a databasedatabase

11

Hidden InformationHidden Information

Number of years of experiencesNumber of years of experiences Great secret recipesGreat secret recipes Success FactorsSuccess Factors

12

Draws ideas from machine learning/AI, Draws ideas from machine learning/AI, pattern recognition, statistics, and pattern recognition, statistics, and database systemsdatabase systems

Traditional TechniquesTraditional Techniquesmay be unsuitable due to may be unsuitable due to – Enormity of dataEnormity of data– High dimensionality High dimensionality

of dataof data– Heterogeneous, Heterogeneous,

distributed nature distributed nature of dataof data

Origins of Data MiningOrigins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

13

What is (not) Data Mining?What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

14

Database Processing vs. Data Database Processing vs. Data Mining ProcessingMining Processing

QueryQuery– Well definedWell defined– SQLSQL

QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of databaseSubset of database

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database

15

Query ExamplesQuery Examples DatabaseDatabase

Data MiningData Mining

– Find all customers who have purchased breadFind all customers who have purchased bread

– Find all items which are frequently purchased Find all items which are frequently purchased with bread. (association rules)with bread. (association rules)

– Find all credit applicants with surname name of Lee.Find all credit applicants with surname name of Lee.– Identify customers who have purchased more Identify customers who have purchased more than $100,000 in the last year.than $100,000 in the last year.

– Find all credit applicants who are good credit Find all credit applicants who are good credit risks. (classification)risks. (classification)– Identify customers with similar eating habits. Identify customers with similar eating habits. (Clustering)(Clustering)

16

Data Mining Models and TasksData Mining Models and Tasks

17

Classification: DefinitionClassification: Definition Given a collection of records (Given a collection of records (training set training set ))

– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the attributes is the attributes is the classclass..

Find a Find a modelmodel for class attribute as a function of the for class attribute as a function of the values of other attributes.values of other attributes.

Goal: Goal: previously unseenpreviously unseen records should be assigned records should be assigned a class as accurately as possible.a class as accurately as possible.– A A test settest set is used to determine the accuracy of the is used to determine the accuracy of the

model. Usually, the given data set is divided into model. Usually, the given data set is divided into training and test sets, with training set used to training and test sets, with training set used to build the model and test set used to validate it.build the model and test set used to validate it.

18

Illustrating Classification TaskIllustrating Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10


11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

19

Examples of Classification Examples of Classification TaskTask

Predicting tumor cells as benign or malignantPredicting tumor cells as benign or malignant

Classifying credit card transactions Classifying credit card transactions as legitimate or fraudulentas legitimate or fraudulent

Classifying secondary structures of protein Classifying secondary structures of protein as alpha-helix, beta-sheet, or random as alpha-helix, beta-sheet, or random coilcoil

Categorizing news stories as finance, Categorizing news stories as finance, weather, entertainment, sports, etcweather, entertainment, sports, etc

20

Classification TechniquesClassification Techniques

Decision Tree based MethodsDecision Tree based Methods Rule-based MethodsRule-based Methods Memory based reasoningMemory based reasoning Neural NetworksNeural Networks Naïve Bayes and Bayesian Belief Naïve Bayes and Bayesian Belief

NetworksNetworks Support Vector MachinesSupport Vector Machines

21

Example of a Decision TreeExample of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

22

Another Example of Decision Another Example of Decision TreeTree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

classMarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

23

Decision Tree Classification Decision Tree Classification TaskTask

Apply

Model

Induction

Deduction

Learn

Model

Model


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No



11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?


Test Set

TreeInductionalgorithm

Training SetDecision Tree

24

Apply Model to Test DataApply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

25


Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

26


Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

27


Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

28


Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

29


Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

Assign Cheat to “No”

30

Decision Tree Classification Decision Tree Classification TaskTask

Apply

Model

Induction

Deduction

Learn

Model

Model


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No



11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?


Test Set

TreeInductionalgorithm

Training Set

Decision Tree

31

What is Cluster Analysis?What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or similar (or related) to one another and different from (or unrelated to) the objects in other groupsunrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

32

Applications of Cluster Applications of Cluster AnalysisAnalysis

UnderstandingUnderstanding– Group related documents for browsing, group Group related documents for browsing, group

genes and proteins that have similar functionality, genes and proteins that have similar functionality, or group stocks with similar price fluctuationsor group stocks with similar price fluctuations

SummarizationSummarization– Reduce the size of large data setsReduce the size of large data sets

33

What is not Cluster Analysis?What is not Cluster Analysis?

Supervised classificationSupervised classification– Have class label informationHave class label information

Simple segmentationSimple segmentation– Dividing students into different registration groups Dividing students into different registration groups

alphabetically, by last namealphabetically, by last name

Results of a queryResults of a query– Groupings are a result of an external specificationGroupings are a result of an external specification

Graph partitioningGraph partitioning– Some mutual relevance and synergy, but areas are Some mutual relevance and synergy, but areas are

not identicalnot identical

34

Notion of a Cluster can be Notion of a Cluster can be AmbiguousAmbiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

35

Types of ClusteringsTypes of Clusterings

A A clusteringclustering is a set of clusters is a set of clusters

Important distinction between Important distinction between hierarchicalhierarchical and and partitionalpartitional sets of clusters sets of clusters

Partitional ClusteringPartitional Clustering– A division data objects into non-overlapping subsets A division data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one (clusters) such that each data object is in exactly one subsetsubset

Hierarchical clusteringHierarchical clustering– A set of nested clusters organized as a hierarchical A set of nested clusters organized as a hierarchical

tree tree

36

Partitional ClusteringPartitional Clustering

Original Points A Partitional Clustering

37

Hierarchical ClusteringHierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

38

Association RulesAssociation Rules Association Rules are a data mining technique and complement Association Rules are a data mining technique and complement

market basket analysis.market basket analysis. All association rules are unidirectional and take the following form:All association rules are unidirectional and take the following form:

Left-hand side rule IMPLIES Right-hand side ruleLeft-hand side rule IMPLIES Right-hand side rule

Both left hand side and the right-hand side of the rule may contain Both left hand side and the right-hand side of the rule may contain multiple items or combination of items such as following:multiple items or combination of items such as following:Yellow Peppers IMPLIES Red Peppers, Bananas, and BakeryYellow Peppers IMPLIES Red Peppers, Bananas, and Bakery

Associations are written as A B, where A is called antecedent or Associations are written as A B, where A is called antecedent or left-hand side(LHS) and B is called consequent or right-hand left-hand side(LHS) and B is called consequent or right-hand side(RHS).side(RHS).

– Ex: “If people buy printer then they buy catridge”» The antecedent is “buy printer” and the consequent is “buy

catridge”

39

Association RulesAssociation Rules

Market Basket AnalysisMarket Basket Analysis-Necessary to have a list of transactions and -Necessary to have a list of transactions and what was purchased in each one.what was purchased in each one.-Ex:-Ex:Transaction 1: Frozen Pizza, Cola, MilkTransaction 1: Frozen Pizza, Cola, MilkTransaction 2: Milk, potato chips,Transaction 2: Milk, potato chips,Transaction 3: Cola, Frozen pizzaTransaction 3: Cola, Frozen pizzaTransaction 4: Milk, pretzelsTransaction 4: Milk, pretzelsTransaction 5: Cola, pretzelsTransaction 5: Cola, pretzels

40


Frozen Frozen PizzaPizza

MilkMilk ColaCola Potato Potato ChipsChips

PretzelsPretzels

Frozen PizzaFrozen Pizza 22 11 22 00 00

MilkMilk 11 33 11 11 11

ColaCola 22 11 33 00 11

Potato ChipsPotato Chips 00 11 00 11 00

PretzelsPretzels 00 11 11 00 22

41


Measures of AssociationMeasures of Association– SupportSupport- the support measure refers to the - the support measure refers to the

percentage of baskets in the analysis where the percentage of baskets in the analysis where the rule is true, that is where both the left-hand side rule is true, that is where both the left-hand side and the right-hand side of the association are and the right-hand side of the association are found.found.

– ConfidenceConfidence» The percentage of baskets from the analysis having the The percentage of baskets from the analysis having the

left-hand side item that also contain the right-hand side left-hand side item that also contain the right-hand side item is found via the confidence measure. This measure item is found via the confidence measure. This measure is different from support in that confidence is the is different from support in that confidence is the probability that the right-hand side item is present given probability that the right-hand side item is present given that we know the left-hand side item is in the basket.that we know the left-hand side item is in the basket.

» Calculated as a ratio:Calculated as a ratio:(frequency of A and B)/(frequency of A)(frequency of A and B)/(frequency of A)

42


Measures of AssociationMeasures of Association

-The support measure-The support measure• for the rulefor the rule

““Cola IMPLIES Frozen Pizza ” is 40%Cola IMPLIES Frozen Pizza ” is 40%

““Frozen Pizza IMPLIES Cola” is 40%Frozen Pizza IMPLIES Cola” is 40%• single itemsingle item

““Milk” is 60%Milk” is 60%

(Note: support considers only the combination and not the (Note: support considers only the combination and not the direction.)direction.)

43


Measures of AssociationMeasures of Association– ConfidenceConfidence

““Milk IMPLIES Potato Chips” has Milk IMPLIES Potato Chips” has confidence: confidence:

==(frequency of A and B)(frequency of A and B) / / (frequency of A)(frequency of A)

==20%20% / / 60%60%

= = 33%33%

44

Data Mining vs. KDDData Mining vs. KDD

Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.

Data Mining:Data Mining: Use of algorithms to Use of algorithms to extract the information and patterns extract the information and patterns derived by the KDD process. derived by the KDD process.

45

KDD ProcessKDD Process

Selection ( Pre-Mining 1):Selection ( Pre-Mining 1): Obtain data from various Obtain data from various sources.sources.

Preprocessing (Pre-Mining 2) :Preprocessing (Pre-Mining 2) : Cleanse data. Cleanse data. Transformation (Pre-Mining 3):Transformation (Pre-Mining 3): Convert to common Convert to common

format. Transform to new format.format. Transform to new format. Data Mining:Data Mining: Obtain desired results. Obtain desired results. Interpretation/Evaluation (Post-Mining):Interpretation/Evaluation (Post-Mining): Present Present

results to user in meaningful manner.results to user in meaningful manner.

Modified from [FPSS96C]

46

KDD Process Ex: Web LogKDD Process Ex: Web Log Selection:Selection:

– Select log data (dates and locations) to useSelect log data (dates and locations) to use Preprocessing:Preprocessing:

– Remove identifying URLsRemove identifying URLs– Remove error logsRemove error logs

Transformation:Transformation: – Sessionize (sort and group)Sessionize (sort and group)

Data Mining:Data Mining: – Identify and count patternsIdentify and count patterns– Construct data structureConstruct data structure

Interpretation/Evaluation:Interpretation/Evaluation:– Identify and display frequently accessed sequences.Identify and display frequently accessed sequences.

Potential User Applications:Potential User Applications:– Cache predictionCache prediction– PersonalisationPersonalisation

47

Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines

•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis

•Neural Networks•Decision Tree Algorithms

•Algorithm Design Techniques•Algorithm Analysis•Data Structures

•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques

48

Data mining: What it can’t do

tell the value of the patterns to the tell the value of the patterns to the organizationorganization

replace skilled business analysts or replace skilled business analysts or managersmanagers

automatically discover solutions without automatically discover solutions without guidanceguidance

Economy & Finance

6107 Ch4 V2