56
Fall 2004 Data Mining 1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Embed Size (px)

Citation preview

Page 1: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 1

IE 483/583Knowledge Discovery and Data Mining

Dr. Siggi Olafsson

Fall 2003

Page 2: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 2

What is Data Mining?

(… and should I be here?)

Page 3: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 3

Dilbert Replies ...

Page 4: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 4

Some Definitions

“Data mining is the extraction of implicit, previously unknown, and potentially

useful information from data.”

“Data mining is the process of exploration and analysis, by automatic or

semiautomatic means, of large quantities of data in order to discover

meaningful patterns and rules.”

Page 5: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 5

• Classification

• Prediction Supervised

• Association discovery

• Clustering Unsupervised

What can Data Mining Do?

Page 6: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 6

Applications of Data Mining

• Manufacturing Process Improvement

• Sales and Marketing

• Mapping the Human Genome

• Diagnosing Breast Cancer

• Financial Crime Identification

• Portfolio Management

Page 7: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 7

Technical Background• Machine Learning

– Data mining: business-oriented use of AI

• Statistics– Regression, sampling, DOE, etc

• Decision Support– Data warehousing, data marts, OLAP, etc

• Interdisciplinary tools put together to form the process of knowledge discovery in databases …

Page 8: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 8

Historical Perspective< 40 Stat Bayes theorem, regression, etc.40s AI Neural networks50s AI Nearest neighbor, single link, perceptron

Stat Resampling, bias reduction, jackknife60s Stat Linear models for classification,

exploratory data analysis (EDA)IR Similarity measures, clusteringDB Relational data model

70s IR Smart IR systemsAI Genetic algorithmsStat EM algorithm, k-means clustering

80s AI Kohonen maps, decision trees90s DB Association rule algorithms, web & search

engines, data warehousing, OLAP

Page 9: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 9

What Changed?

• Very large databases

• Increased computational power as enabler

• Business perspective

Page 10: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 10

Knowledge Discovery in Databases

Databases Data warehouse

Prepared Data

Model/StructuresKnowledge

Data Warehouse Systems Engineering

Knowledge Discovery and Data Mining

Page 11: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 11

Course Information

• We assume data is ready for mining

• Thus, we focus on:– models and structures, and– algorithms

• More information on course homepage

http://www.public.iastate.edu/~olafsson/mining.html

Page 12: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 12

Page 13: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 13

Course Outline• Introduction• Exploratory Data Mining• Supervised Learning• Unsupervised Learning• Optimization Methods in Learning• Selected Advanced Topics

– Mining the Web– Customer Relationship Management (CRM)

• Course Review

Page 14: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 14

Questions?

Page 15: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 15

Data Mining

• Discover patterns in data– automatic or semi-automatic process– meaningful or useful pattern– large amounts of data

• What does such a pattern look like?

Black box Transparent box

Page 16: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 16

Describing Structural Patterns

• Some ways of representing knowledge:– Decision tables– Decision trees– Classification rules– Association rules– Regression trees– Clusters

Page 17: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 17

The Weather ProblemOutlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No

Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No

Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes

Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes

Rainy Mild High TRUE No

Page 18: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 18

A Decision List

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

• These are classification rules

Page 19: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 19

Association Rules

• Many association rules can be inferred:

if temperature = cool then humidity = normal

if humidity = normal and windy = false then play = yes

if outlook = sunny and play = no then humidity = high

Page 20: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 20

Three Layers of the Process

Inputs

Outputs

Algorithms

Page 21: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 21

Inputs

• Three forms– Concepts

• concept description - what you want to learn

– Instances• examples - what you learn from

– Attributes• features of instances - variables you have values for

Page 22: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 22

Concepts: Styles of Learning

• Classification (supervised) learning

• Association learning

• Clustering

• Numeric prediction

Page 23: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 23

Instances: Learn from Examples

• Set of instances to be classified, or associated, or clustered

• Example of concept to be learned• Data set: flat file (single relation)

– denormalization

• Family tree example – concept: sister– example: family tree

Page 24: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 24

Family Tree

S tevenM

G rah amM

P amF

P e te r (M ) = P e g g y (F)

IanM

P ip paF

B rianM

G ra ce (F ) = R a y (M )

A nnaF

N ikk iF

=

Page 25: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 25

Denormalizing Relational DataName Gender Parent1 Parent2 Name Gender Parent1 Parent2 Sister

of?

Steven Male Peter Peggy Pam Female Peter Peggy Yes

Ian Male Grace Ray Pippa Female Grace Ray Yes

Brian Male Grace Ray Pippa Female Grace Ray Yes

Anna Female Pam Ian Nikki Female Pam Ian Yes

Nikki Female Pam Ian Anna Female Pam Ian Yes

Allothers

No

Page 26: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 26

Denormalization Problems

• Computational and storage costs

• Trivial regularities

customers products

product supplier

supplier supplier address

• Infinite relations

Page 27: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 27

Content of Instances: Attributes

• Instance characterized by values of its (predefined) set of attributes– Numeric (“continuous”)– Nominal (categorical)– Ordinal (rank)– Interval– Ratio

Focus in this class

Page 28: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 28

Data Preparation• Data …

– assembly• set of instances/denormalizing relational data

– integration• enterprise-wide database/data warehouse

– cleaning• missing data

– aggregation• good information

Page 29: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 29

ARFF Format

• Used by JAVA package (Weka)

• Independent, unordered instances

• No relationship between instances

Page 30: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 30

Weather Data% ARFF file for the weather data with some numeric features%@relation weather

@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { true, false }@attribute play? { yes, no }

@data%% 14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yesrainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no

Page 31: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 31

Features

• % = comments

• @relation <name>

• @attribute <name> <type>– Attribute types: Nominal and numeric

• @data– List of instances– Missing values represented by ?

Page 32: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 32

Other Issues

• Missing data

• Inaccurate values

• Look at the data!!!

Page 33: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 33

Recall the Three Layers of the Data Mining Process

Inputs

Outputs(structural patterns)

Algorithms

Done

Next

Page 34: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 34

Describing Structural Patterns

• Ways of representing knowledge:– Decision tables– Decision trees– Classification rules– Association rules– Regression trees– Clusters

Page 35: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 35

The Weather ProblemOutlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No

Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No

Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes

Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes

Rainy Mild High TRUE No

Page 36: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 36

A Decision List

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

Page 37: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 37

A Decision TreeOutlook

Humidity Windy

Play=No

Sunny RainyOvercast

High

Play=Yes

Play=No

TRUE

Page 38: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 38

Concepts: Styles of Learning

• Classification (supervised) learning

• Association learning

• Clustering

• Numeric prediction

Page 39: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 39

Classification Rules

• Classification easily read off decision trees

• How?

• Other direction possible, but not as straightforward

If a and b then xIf c and d then x

Page 40: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 40

Corresponding Decision Tree

a

b c

c d

d

x

x

x

y

y y

yy

yn

nn

nn

n

Page 41: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 41

Replicated Subtree Problem

X=1

Y=1 Y=1

b

y

ynn

n

aab

If x=1 and y=0 then aIf x=0 and y=1 then aIf x=0 and y=0 then bIf x=1 and y=1 then b

Page 42: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 42

Replicated Subtree Problem

If x=1 and y=1 then aIf z=1 and w=1 then aOtherwise b

x,y,z,w take values 1,2,3

Page 43: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 43

If x and y then a EXCEPT if z then b

Rules with exceptions

• Account for new instances

• Exceptions from exceptions, etc

Page 44: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 44

Association Rules

• Coverage (support): number of instances it predicts correctly• Accuracy (confidence): coverage divided by number of instances it applies to

• Coverage = 4• Accuracy = 100%

If temperature = cool then humidity = normal

Page 45: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 45

InterpretationIf windy = false and play = no then outlook = sunny and humidity = high

If windy = false and play = no then outlook = sunny

If windy = false and play = no then humidity = high

If humidity = high and windy = false and play = no then outlook = sunny

Page 46: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 46

The Shapes Problem

Shaded=standingUnshaded=lying

Page 47: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 47

InstancesWidth Height Sides Class2 4 4 standing3 6 4 standing4 3 4 lying7 8 3 standing7 6 3 lying2 9 4 standing9 1 4 lying10 2 3 lying

Page 48: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 48

Classification Rules

If width 3.5 and height < 7.0 then lyingIf height 3.5 then standing

• Work well to classify these instances

• Problems?

Page 49: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 49

Relational Rules

• Rules comparing attributes to constants are called propositional rules

• Structural patterns?

If width > height then lyingIf height > width then standing

Page 50: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 50

CPU Performance Example

Cycle Cache Performancetime

(ns) min max

MYCT MMIN MMAX CACH CHMIN CHMAX PRP

1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132

…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45

ChannelsMain memory(KB)

Page 51: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 51

Numerical Prediction: regression equation

CHMAX

CHMIN

CACH

MMAX

MMIN

MYCT

PRP

46.1

270.0

630.0

006.0

015.0

049.0

1.56

Page 52: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 52

Regression TreeCHMIN

CACH MMAX

7.5 > 7.5

MMAX 64.6 MMAX

8.5 (8.5,28]>28

- Accuracy?- Large and possibly awkward

Page 53: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 53

Model TreesCHMIN

CACH MMAX

7.5 > 7.5

MMAX LM4

8.5 >8.5

LM5 LM6

28000 > 28000

PRPLM

CHMINMMAXPRPLM

2

77.2004.029.8 1

Page 54: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 54

Instance-Base Representation

• Store actual instances

• New instance: algorithm finds “most similar” stored instance

• Features– What is a similar instance?– Need store (all?) instances– Really a black box method

Page 55: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 55

Clusters:

d ea j c

k h f b

ig

d ea j c

k h f b

ig

Page 56: Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

Fall 2004 Data Mining 56

Next: Algorithms