Data Mining and Machine Learning - Aalborg Universitetpeople.cs.aau.dk/~tdn/Teaching/DWML07/Slides/dwml07-1.pdf · [...] this project deﬁned and validated a data mining ... Spam

Data Mining and Machine Learning

DWML, 2007 1/46

What is Data Mining?

?

DWML, 2007 2/46


?

DWML, 2007 2/46


!

DWML, 2007 2/46


Definitions

• Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, andpotentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991].

• Data Mining is a step in the KDD process consisting of applying computationaltechniques that, under acceptable computational efficiency limitations, produce aparticular enumeration of patterns (or models) over the data [Fayyad,Piatetsky-Shapiro, Smyth 1996].

• Data Mining is the analysis of (often large) observational data sets to find unsuspectedrelationships and to summarize the data in novel ways that are both understandableand useful to the data owner [Hand, Mannila, Smyth 2001].

• The field of machine learning is concerned with the question of how to constructcomputer programs that automatically improve with experience [Mitchell, 1997]

DWML, 2007 3/46


Definitions

• Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, andpotentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991].

• Data Mining is a step in the KDD process consisting of applying computationaltechniques that, under acceptable computational efficiency limitations, produce aparticular enumeration of patterns (or models) over the data [Fayyad,Piatetsky-Shapiro, Smyth 1996].

• Data Mining is the analysis of (often large) observational data sets to find unsuspectedrelationships and to summarize the data in novel ways that are both understandableand useful to the data owner [Hand, Mannila, Smyth 2001].

• The field of machine learning is concerned with the question of how to constructcomputer programs that automatically improve with experience [Mitchell, 1997]

Data Mining vs. Machine Learning

• Different roots: information extraction vs. intelligent machines• Today very large overlap of techniques and applications• Some remaining differences: emphasis on large datasets (DM), theoretical analysis of

learnability (ML),. . .• For this course: Data Mining ≈ Machine Learning

DWML, 2007 3/46


Data Mining in practice

DWML, 2007 4/46



Real−life data Off−the−shelf algorithm

adaptpreprocess

DWML, 2007 4/46




adapt

evaluate + iterate

preprocess

DWML, 2007 4/46




adapt

evaluate + iterate

preprocess

data/domain − specific operationsgeneral algorithmic methods

DWML, 2007 4/46

The CRISP model

Background

Developed by a four member consortium in a EU project. Members of the consortium:• Teradata (NCR)• SPSS (statistical software)• DaimlerChrysler• OHRA (Insurance and Banking)

Consortium supported by a special interest group composed of over 300 organizationsinvolved in data mining projects.

Aim

From http://www.crisp-dm.org/:

The CRISP-DM project has developed an industry- and tool-neutral DataMining process model. [. . . ] this project defined and validated a data miningprocess that is applicable in diverse industry sectors. This will make large datamining projects faster, cheaper, more reliable and more manageable. Even smallscale data mining investigations will benefit from using CRISP-DM.

DWML, 2007 5/46

The CRISP model

Phases of the CRISP DM Process Model

(Illustration from www.crisp-dm.org)DWML, 2007 6/46

The CRISP model

Business/Data understanding

• Vision: Data Mining extracts whatever interesting hidden information there is in thedata

• Reality: Data Mining techniques solve several types of well-defined tasks• Reality: The data used must support the task at hand• Reality: The data miner must understand the background of the data, in order to select

an appropriate data mining technique

DWML, 2007 7/46

The CRISP model

Our Focus

DWML, 2007 8/46

The CRISP model

Selecting the Modeling Technique

(Time, Data Characteristics,Staff Training/Knowledge)

Tool(s) Selected

Constraints

Political Requirements

Techniques Appropriate for Problem

Universe of Techniques

(Management,Understandability)

(Defined by Tool)

DWML, 2007 9/46

Types of Tasks and Models

Prediction (Supervised Learning)

• Task: predict some (unobserved) target variable based on observed values of attributevariables

- Regression (Larose: Estimation), if target is continuous

- Classification, if target is discrete• Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,. . .

Clustering

• Task: identify coherent subgroups in data• Models e.g.: k-means, hierarchical clustering,. . .

Association analysis

• Task: identify patterns of co-occurrence of attribute values• Models: Apriori and extensions

Visualization (Exploratory Data Analysis)

• Task: find intelligible visualization of relevant data properties• Models: Graphs, plots,. . .

DWML, 2007 10/46

Examples: Regression

Nutritional rating of cereals

Data: nutritional information and ratings for 77 cereals.Task: find best linear approximation of the dependency of rating on sugars

DWML, 2007 11/46

Examples: Classification

Text Categorization

The Association for Computing Machinery (ACM) maintains a subject classification schemefor computer science research papers. Part of the subject hierarchy (1998 version):

I. Computing MethodologiesI.2 Artificial Intelligence

I.2.6 Learning- Analogies- Concept learning- Connectionism and neural nets- Induction- Knowledge acquisition- Language acquisition- Parameter learning

Papers are manually classified by authors or editors.

Data: collection of classified papers (full text or abstracts)Task: build a classifier that automatically assigns a subject index to new, unclassified papers.

DWML, 2007 12/46


Spam Filtering

Spam filtering in Mozilla: user trains the mail reader to recognize spam by manually labelingincoming mails as spam/no spam.

Data: collection of user-classified emails (full text).Task: build a classifier that automatically categorizes an incoming email as spam/no spam

DWML, 2007 13/46


Character Recognition

Example for a Pattern Recognition problem (pattern recognition is an older discipline thandata mining, but now can also be seen as a sub-area of data mining):

Data: collection of handwritten characters, correctly labeled.Task: build a classifier that identifies new handwritten characters.

DWML, 2007 14/46


Credit Rating

From existing customer data predict whether a person applying for a new loan will repay ordefault on the loan.

Data: existing customer records with attributes like age, employment type, income, . . . andinformation on payback history.Task: build a classifier that predicts whether a new customer will repay the loan.

DWML, 2007 15/46

Examples: Clustering

Text Categorization

Web mining: automatically detect similarity between web pages (e.g. to support searchengines or automatic construction of internet directories).

Data: the WWW.Task: Construct a (similarity) model for pages on the WWW.

DWML, 2007 16/46

Examples: Clustering

Bioinformatics: Phylogenetic Trees

From biological data construct a model of evolution.

Hom

o Sapiens

Pan T

roglodytes

Rattus N

orvegicus

Bacillus Subtilis

Bacillus H

alodurans

Caulobacter C

rescentus

Lactococcus L

actis

Data: e.g. genome sequences of different animal species.Task: construct a hierarchical model of similarity between the species.

DWML, 2007 17/46

Examples: Association Analysis

Association Rules

Data: transaction dataTransaction Items bought

1 Beer,Soap,Milk,Butter2 Beer,Chips,Butter3 Milk,Spaghetti,Butter,Tomatos

. . . . . .

Task: infer association rules

{Beer} ⇒ {Chips}{Spaghetti,Tomatos} ⇒ {Wine}. . .

DWML, 2007 18/46

Tools

WEKA

• Free open source Java toolbox (www.cs.waikato.ac.nz/ml/weka/)• Many methods, good interface

Clementine

• Commercial system, Windows only• Many methods, good interface, integrated use of MS SQL server

For all toolboxes: easy use of methods can be dangerous – correct interpretation of resultsrequires understanding of methods. Documentation essential (and often a weak point...)!

DWML, 2007 19/46

Classification and Decision Trees

DWML, 2007 20/46

Classification

A high-level view

Classifi

er Spamyes/no

DWML, 2007 21/46

Classification

A high-level view

SubAllCapyes/no

TrustSendyes/no

InvRetyes/no

Body’adult’yes/no

Body’zambia’yes/no

Classifi

er Spamyes/no

DWML, 2007 21/46

Classification

A high-level view

Classifi

erCell-11..64

Cell-21..64

Cell-31..64

Cell-3241..64

SymbolA..Z,0..9

DWML, 2007 21/46

Classification

Labeled DataIn

stan

ces

(Cas

es,E

xam

ples

)

Attributes Class variable

(Features, Predictor Variables) (Target variable)SubAllCap TrustSend InvRet . . . B’zambia’ Spam

y n n . . . n yn n n . . . n nn y n . . . n yn n n . . . n n

. . . . . . . . . . . . . . . . . .

Inst

ance

s


Cell-1 Cell-2 Cell-3 . . . Cell-324 Symbol

1 1 4 . . . 12 B1 1 1 . . . 3 134 37 43 . . . 22 Z1 1 1 . . . 7 0

. . . . . . . . . . . . . . . . . .

(In principle, any attribute can become the designated class variable)DWML, 2007 22/46

Classification

Attribute Types

Each attribute (including the class variable) has associated with it a set of possible values orstates. E.g.

States(A) = {yes,no}States(A) = {red,blue,green}States(A) = {010100, 020100, . . . , 311299}

States(A) = R

States(A) finite: A is called discreteStates(A) = R: A is called continuous or numericStates(A) = N: A can be interpreted as continuous (N ⊂ R), or made discrete

by replacing N e.g. with {1, 2, . . . , 100, > 100} (few data miningmethods are specifically adapted to integer valued attributes).

DWML, 2007 23/46

Classification

Complete/Incomplete Data

Name Gender DoB Income Customer since Last Purchase

Thomas Jensen m 050367 190000 010397 250504Jens Nielsen m 171072 250000 051103 040204Lene Hansen f 021159 140000 300300 250105Ulla Sørensen f 220879 210000 180998 031099. . . . . . . . . . . . . . . . . .

Name Gender DoB Income Customer since Last Purchase

Thomas Jensen m 050367 190000 010397 250504Jens Nielsen m ? ? 051103 040204Lene Hansen f 021159 ? 300300 250105Ulla Sørensen f ? ? 180998 031099. . . . . . . . . . . . . . . . . .

DWML, 2007 24/46

Classification

Classification

Classification data in general:

Attributes : Variables A1, A2, . . . , An (discrete or continuous).

Class variable : Variable C. Always discrete: States(C) = {c1, . . . , cl} (set of class labels)

A (complete data) Classifier is a mapping

C : States(A1, . . . , An) → States(C).

A classifier able to handle incomplete data provides mappings

C : States(Ai1, . . . , Aik

) → States(C)

for subsets {Ai1, . . . , Aik

} of {A1, . . . , An}.

A classifier partitions Attribute-value space (also: instance space) into subsets labelled withclass labels.

DWML, 2007 25/46

Classification

Iris dataset

PLPW

SL

SW

Measurement of petal width/length and sepalwidth/length for 150 flowers of 3 different speciesof Iris.

first reported in:Fisher,R.A. "The use of multiple measurementsin taxonomic problems" Annual Eugenics, 7(1936).


SL SW PL PW Species

5.1 3.5 1.4 0.2 Setosa4.9 3.0 1.4 0.2 Setosa6.3 2.9 6.0 2.1 Virginica6.3 2.5 4.9 1.5 Versicolor. . . . . . . . . . . . . . .

DWML, 2007 26/46

Classification

Labeled data in instance space:

DWML, 2007 27/46

Classification

Labeled data in instance space:

Virginica

Versicolor

Setosa

Partition defined by classifier

DWML, 2007 27/46

Classification

Decision Regions

Axis-parallel linear: e.g. Deci-sion Trees

Piecewise linear: e.g. NaiveBayes

Nonlinear: e.g. Neural Network DWML, 2007 28/46

Classification

Classifiers differ in . . .

• Model space: types of partitions and their representation.• how they compute the class label corresponding to a point in instance space (the

actual classification task).• how they are learned from data.

Some important types of classifiers:

• Decision trees• Naive Bayes classifier• Other probabilistic classifiers (TAN,. . . )• Neural networks• K-nearest neighbors

DWML, 2007 29/46

Decision Trees

Example

Attributes: height ∈ [0, 2.5], sex ∈ {m, f}. Class labels: {tall, short}.

h h

2.0

1.0

0

2.5

fm

sfm

< 1.8≥ 1.8

< 1.7≥ 1.7

short

short

short short

talltall

tall tall

Partition of instance space Representation by decision tree

DWML, 2007 30/46

Decision Trees

A decision tree is a tree

- whose internal nodes are labeled with attributes

- whose leaves are labeled with class labels

- edges going out from node labeled with attribute A are labeled with subsets ofStates(A), such that all labels combined form a partition of States(A).

Possible partitions e.g.:

States(A) = R : [−∞, 2.3[, [2.3,∞]

[−∞, 1.9[, [1.9, 3.5[, [3.5,∞]

States(A) = {a, b, c} : {a}, {b}, {c}

{a, b}, {c}

DWML, 2007 31/46

Decision Trees

Decision tree classification

Each point in the instance space is sorted into a leaf by the decision tree. It is classifiedaccording to the class label at that leaf.

hh

sm f

[m,1.85]

< 1.8≥ 1.8

< 1.7≥ 1.7

shortshort talltall

C([m, 1.85]) = tall

DWML, 2007 32/46

Decision Trees

How to learn a decision tree?

Given a dataset:

Id. Savings Assets Income Credit Risk($ 1000s)

1 Medium High 75 Good2 Low Low 50 Bad3 High Medium 25 Bad4 Medium Medium 50 Good5 Low Medium 100 Good6 High High 25 Good7 Low Low 25 Bad8 Medium Medium 75 Good

We want to build a decision tree that• is small• has high classification accuracy

DWML, 2007 33/46

Decision Trees

Some simple candidate trees:

Savings

1,4,82,5,7 3,6

Income

1,5,82,3,4,6,7

Assets

3,4,5,82,7 1,6

Income

1,2,4,5,83,6,7

L M H L M H

G:1, B:2

G:1, B:2 G:1, B:1

G:2, B:3

G:0, B:2 G:3, B:1 G:2, B:0

G:3, B:0

G:3, B:0

G:4, B:1

≤ 50 > 50 ≤ 25 > 25

DWML, 2007 34/46

Decision Trees

How accurate are these trees? Accurate trees: pure class label distributions at the leaves:

(1,1)(2,2)

(2,3)(3,1)(1,2)

pure impure

(3,0)(0,2)(2,0)

DWML, 2007 35/46

Decision Trees

How accurate are these trees? Accurate trees: pure class label distributions at the leaves:

(1,1)(2,2)

(2,3)(3,1)(1,2)

pure impure

(3,0)(0,2)(2,0)

Entropy

A measure of impurity: for S = (x1, x2, . . . , xn) with x =P

n

i=1xi :

Entropy(S) = −n

X

i=1

xi

xlog2(

xi

x)

Entropy(2, 0) = Entropy(0, 2) = Entropy(3, 0) = −(1 · log2(1) + 0 · log2(0)) = 0 + 0 = 0

Entropy(3, 1) = −(0.75 · log2(0.75) + 0.25 · log2(0.25)) = 0.311 + 0.5 = 0.811

Entropy(2, 3) = −(0.4 · log2(0.4) + 0.6 · log2(0.6)) = 0.528 + 0.442 = 0.97

Entropy(2, 2) = Entropy(1, 1) = −(0.5 · log2(0.5) + 0.5 · log

2(0.5)) = 0.5 + 0.5 = 1.0

DWML, 2007 35/46

Decision Trees

Information Gain

A B

5,1 2,21,18,2 2,0

0.722 0.0 0.651.0 1.0Entropy:

L M Htrue false

Expected Entropy : A : 10

12· 0.722 + 2

12· 1.0 = 0.768

B : 2

12· 0.0 + 6

12· 0.65 + 4

12· 1.0 = 0.658

Data Entropy : Entropy(9, 3) = 0.811

Information Gain : A : 0.811 − 0.768 = 0.043

B : 0.811 − 0.658 = 0.153

DWML, 2007 36/46

Decision Trees

Expected entropies:

Assets

3,10,2 2,0

Income

4,11,2

Savings

3,01,2 1,1

Income

3,02,3

L L M HM H

≤ 50 > 50 ≤ 25 > 25

3

8· 0.918 + 3

8· 0.0 + 2

8· 1.0 = 0.594 2

8· 0.0 + 4

8· 0.811 + 2

8· 0.0 = 0.405

5

8· 0.97 + 3

8· 0.0 = 0.606 3

8· 0.918 + 5

8· 0.722 = 0.795

Information gains are Entropy(5, 3) = 0.954 minus expected entropies.DWML, 2007 37/46

Decision Trees

After the second (and final) ID3 iteration:

Assets

2,7 1,6Savings

4,85 3

L

L

M H

M HG:0, B:2 G:2, B:0

G:2, B:0 G:0, B:1G:0, B:1

DWML, 2007 38/46

Decision Trees

After the second (and final) ID3 iteration:

Assets

Savings

good

good

bad bad

bad

L

L

M H

M H

DWML, 2007 38/46

Decision Trees

ID3 algorithm for decision tree learning

• Determine attribute A with highest information gain (for continuous attributes: alsodetermine split-value)

• Construct decision tree with root A, and one leaf for each value of A (two leaves if A iscontinuous)

• For a non-pure leaf L: determine attribute B with highest information gain for the datasorted into L.

• Replace L with a subtree consisting of root B and one leaf for each value of B (twoleaves if B is continuous)

• Continue until all leaves are pure, or some other termination condition applies (e.g.:possible information gains below a given threshold)

• Label each leaf with the class label that is most frequent among the data sorted intothe leaf

DWML, 2007 39/46

Decision Trees

Pros and Cons

+ Easy to interpret.

+ Efficient learning methods.

- Difficulties with handling missing data.

DWML, 2007 40/46

Overfitting

The problem

Assets

Savings

good

good

bad bad

bad

L

L

M H

M H

Predictions made by the learned model:

Assets=M,Savings=M ⇒ Risk=goodAssets=M,Savings=H ⇒ Risk=bad

• The training data contained a single case with Assets=M,Savings=H• This case had the (uncharacteristic?) class label Risk=bad• The model is overfitted to the training data• With the prediction Assets=M,Savings=H ⇒ Risk=good we will likely obtain a higher

accuracy on future casesDWML, 2007 41/46

Overfitting

The general problem

• Complex models will represent properties of the training data very precisely• The training data may contain some peculiar properties that are not representative for

the domain• The model will not perform optimally in classifying future instances

Model size

Future data

Training data

Cla

ssif

icat

ion

erro

r

DWML, 2007 42/46

Overfitting

Decision Tree Pruning

To prevent overfitting, extensions of ID3 (C4.5, C5.0) add a pruning step after the treeconstruction:

• Data is split into training set and test set• Decision tree is learned using training data only• Pruning: for internal node A: replace subtree rooted at A with leaf if this reduces the

classification error on the test set.

DWML, 2007 43/46

Overfitting

Example

Full

Assets

Savings

good

good

bad bad

bad

L

L

M H

M H

Pruned

Assets

goodbad good

L M H

Test data (show only cases with As-sets=M):

Id. Savings Assets Income Risk

9 High Medium 50 Good10 Low Medium 50 Bad11 High Medium 75 Good12 Medium Medium 50 Good

Accuracy of full tree on test data: 50%Accuracy of pruned tree on test data: 75%⇒ prune the Savings node.

DWML, 2007 44/46

Overfitting

Model Tuning with Test Set

Modellearn apply

tune

finalModel

split

parametertuning

learn

setting

Train Test

Train

Test

Data

Data

• Models can be adjusted or tuned (e.g. pruning subtrees, setting model parameters)• Tuning can be an iterative process that requires repeated evaluations on the test set• A final model is learned using all the data• Problem: part of data “wasted” as test set

DWML, 2007 45/46

Overfitting

Cross Validation

• Partition the data into n subsets or folds (typically: n = 10).• For each setting of tuning parameter:

- for i = 1 to n:learn a model using folds 1, . . . , i − 1, i + 1, . . . , n as training datameasure performance on fold i

- model performance = average performance on the n test sets• Choose parameter setting with best performance• Learn final model with chosen parameter setting using the whole available data

Cross Validation is also used for final evaluation of a learned model.

DWML, 2007 46/46

Documents

Data Mining and Machine Learning - Aalborg Universitetpeople.cs.aau.dk/~tdn/Teaching/DWML07/Slides/dwml07-1.pdf · [...] this project deﬁned and validated a data mining ... Spam