54
DATABASE SYSTEMS GROUP Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture: Dr. Eirini Ntoutsi Tutorials: Erich Schubert http://www.dbs.ifi.lmu.de/cms/Knowledge_Discovery_in_Databases_I_(KDD_I) Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture 4: Classification 1 Knowledge Discovery in Databases I: Classification

Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Lecture notes

Knowledge Discovery in DatabasesSummer Semester 2012

Lecture: Dr. Eirini NtoutsiTutorials: Erich Schubert

http://www.dbs.ifi.lmu.de/cms/Knowledge_Discovery_in_Databases_I_(KDD_I)

Ludwig-Maximilians-Universität MünchenInstitut für InformatikLehr- und Forschungseinheit für Datenbanksysteme

Lecture 4: Classification

1Knowledge Discovery in Databases I: Classification

Page 2: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Sources

• Previous KDD I lectures on LMU (Johannes Aßfalg, Christian Böhm, KarstenBorgwardt, Martin Ester, Eshref Januzaj, Karin Kailing, Peer Kröger, Jörg Sander, Matthias Schubert, Arthur Zimek)

• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011.

• Margaret Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

• Tom Mitchel, Machine Learning, McGraw Hill, 1997.

• Wikipedia

Knowledge Discovery in Databases I: Classification 2

Page 3: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Outline

• Introduction

• The classification process

• Classification (supervised) vs clustering (unsupervised)

• Decision trees

• Evaluation of classifiers

• Things you should know

• Homework/tutorial

Knowledge Discovery in Databases I: Classification 3

Page 4: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classification problem

Given:• a dataset D={t1,t2,…,tn} and

• a set of classes C={c1,…,ck}

the classification problem is to define a mapping f:DC where each ti is assigned to one class cj.

Classification – predicts categorical (discrete, unordered) class labels – Constructs a model (classifier) based on a training set– Uses this model to predict the class label for new unknown-class instances

Prediction– is similar, but may be viewed as having infinite number of classes– more on prediction in next lectures

Knowledge Discovery in Databases I: Classification 4

Page 5: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

5

A simple classifier

A simple classifier:

if Alter > 50 then Risk= low;

if Alter ≤ 50 and Autotyp=LKW then Risk=low;

if Alter ≤ 50 and Autotyp ≠ LKW then Risk = high.

ID Alter Autotyp Risk1 23 Familie high2 17 Sport high3 43 Sport high4 68 Familie low5 32 LKW low

Knowledge Discovery in Databases I: Classification

Page 6: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Applications

• Credit approval• Classify bank loan applications as e.g. safe or risky.

• Fraud detection• e.g., in credit cards

• Churn prediction• E.g., in telecommunication companies

• Target marketing• Is the customer a potential buyer for a new computer?

• Medical diagnosis

• Character recognition

• …

Knowledge Discovery in Databases I: Classification 6

Page 7: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Outline

• Introduction

• The classification process

• Classification (supervised) vs clustering (unsupervised)

• Decision trees

• Evaluation of classifiers

• Things you should know

• Homework/tutorial

Knowledge Discovery in Databases I: Classification 7

Page 8: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classification techniques

• Typical approach:– Create specific model by evaluating training data (or using domain

experts’ knowledge).• Assess the quality of the model

– Apply model developed to new data.

• Classes must be predefined!!!• Many techniques

– Decision trees– Naïve Bayes– kNN– Neural Networks – Support Vector Machines– ….

Knowledge Discovery in Databases I: Classification 8

Page 9: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classification technique (detailed)

• Model construction: describing a set of predetermined classes

– The set of tuples used for model construction is training set– Each tuple/sample is assumed to belong to a predefined

class, as determined by the class label attribute– The model is represented as classification rules, decision

trees, or mathematical formulae

• Model evaluation: estimate accuracy of the model– The set of tuples used for model evaluation is test set– The class label of each tuple/sample in the test set is

known in advance– The known label of test sample is compared with the

classified result from the model• Accuracy rate is the percentage of test set samples that are

correctly classified by the model

– Test set is independent of training set, otherwise over-fitting will occur

• Model usage: for classifying future or unknown objects– If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known

Knowledge Discovery in Databases I: Classification 9

Class attribute: tenured={yes, no}

predefined class values

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Training set

known class label attribute

Test set

NAME RANK YEARS TENURED PREDICTEDMaria Assistant Prof 3 no noJohn Associate Prof 7 yes noFranz Professor 3 yes yes

known class label attribute

predicted class value by the model

NAME RANK YEARS TENURED PREDICTEDJeff Professor 4 ? yesPatrick Associate Prof 8 ? yesMaria Associate Prof 2 ? no

unknown class label attribute

predicted class value by the model

Page 10: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier(Model)

Model construction

Knowledge Discovery in Databases I: Classification 10

Class attribute

IF (rank!=’professor’) AND (years < 6) THEN tenured = ‘no’

Attributes

Page 11: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Model evaluation

Knowledge Discovery in Databases I: Classification 11

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier(Model)

IF (rank!=’professor’) AND (years < 6) THEN tenured = ‘no’

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Classifier qualityIs it acceptable?

Page 12: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Unseen Data

Tenured?

Model usage for prediction

IF (rank = ‘professor’) OR (years > 6) THEN tenured = ‘yes’

Classifier(Model)

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Knowledge Discovery in Databases I: Classification 12

NAME RANK YEARS TENUREDJeff Professor 4 ?

Patrick Assistant Profe 8 ?

Maria Assistant Profe 2 ?IF (rank!=’professor’) AND (years < 6) THEN tenured = ‘no’

Tenured?

Tenured?

?

?

Page 13: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Outline

• Introduction

• The classification process

• Classification (supervised) vs clustering (unsupervised)

• Decision trees

• Evaluation of classifiers

• Things you should know

• Homework/tutorial

Knowledge Discovery in Databases I: Classification 13

Page 14: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

A supervised learning task

• Classification is a supervised learning task– Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

– New data is classified based on the training set

• Clustering is an unsupervised learning task– The class labels of training data is unknown

– Given a set of measurements, observations, etc., the goal is to group the

data into groups of similar data (clusters)

Knowledge Discovery in Databases I: Classification 14

Page 15: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Supervised learning example

Knowledge Discovery in Databases I: Classification 15

Paper clips

New object (unknown class)

Nails

Screw

Classification modelWidth[cm]

Hei

ght [

cm]

Screw? Nail? Paper clip?

Question:What is the class of a new object???Is it a screw, a nail or a paper clip?

Screw? Nail? Paper clip? Screw? Nail? Paper clip?

Page 16: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Unsupervised learning example

Knowledge Discovery in Databases I: Classification 16

Cluster 1: paper clipsCluster 2: nails

Hei

ght [

cm]

Width[cm]

Clustering

Question:Is there any structure in data (based on their characteristics, i.e., width, height)?

Page 17: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classification techniques

• Statistical methods– Bayesian classifiers etc

• Partitioning methods– Decision trees etc

• Similarity based methods– K-Nearest Neighbors etc

Knowledge Discovery in Databases I: Classification 17

Page 18: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Outline

• Introduction

• The classification process

• Classification (supervised) vs clustering (unsupervised)

• Decision trees

• Evaluation of classifiers

• Things you should know

• Homework/tutorial

Knowledge Discovery in Databases I: Classification 18

Page 19: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Decision trees (DTs)

• One of the most popular classification methods

• DTs are included in many commercial systems nowadays

• Easy to interpret, human readable, intuitive

• Simple and fast methods

• Partition based method: Partitions the space into rectangular

regions

• Many algorithms have been proposed

– ID3 (Quinlan 1986), C4.5 (Quinlan 1993), CART (Breiman et al 1984)….

Knowledge Discovery in Databases I: Classification 19

Page 20: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Decision tree for the “play tennis” problem

Knowledge Discovery in Databases I: Classification 20

Training set

Page 21: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Representation

• Representation– Each internal node specifies a test of some attribute of the instance

– Each branch descending from a node corresponds to one of the possible values for this attribute

– Each leaf node assigns a class label

• Decision trees classify instances by sorting them down the tree from the root to

some leaf node, which provides the classification of the instance

Knowledge Discovery in Databases I: Classification 21

Training set

Attribute test

Class value

Attribute value

Each branch corresponds to a possible value of outlook

Page 22: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Representation cont’

• Decision trees represent a disjunction of conjunctions of constraints on the attribute values of the instances

• Each path from the root to a leaf node, corresponds to a conjunction of attribute tests

• The whole tree corresponds to a disjunction of these conjunctions

• We can “translate” each path into IF-THEN rules (human readable)

Knowledge Discovery in Databases I: Classification 22

IF ((Outlook = Sunny) ^ (Humidity = Normal)),THEN (Play tennis=Yes)

IF ((Outlook = Rain) ^ (Wind = Weak)),THEN (Play tennis=Yes)

Page 23: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

The basic decision tree learning algorithm

Basic algorithm (ID3, Quinlan 1986)– Tree is constructed in a top-down recursive divide-and-conquer manner

– At start, all the training examples are at the root node

– The question is “which attribute should be tested at the root?”• Attributes are evaluated using some statistical measure, which determines how well each

attribute alone classifies the training examples

• The best attribute is selected and used as the test attribute at the root

– For each possible value of the test attribute, a descendant of the root node is created and the instances are mapped to the appropriate descendant node.

– The procedure is repeated for each descendant node, so instances are partitioned recursively.

When do we stop partitioning?– All samples for a given node belong to the same class

– There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf

Knowledge Discovery in Databases I: Classification 23

Page 24: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Algorithm cont’

• Pseudocode

• But, …. which attribute is the best?

Knowledge Discovery in Databases I: Classification 24

• The goal is to select the attribute that is most useful for classifying examples.

• By useful we mean that the resulting partitioning is as pure as possible

• A partition is pure if all its instances belong to the same class.

Page 25: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Attribute selection measure: Information gain

• Used in ID3

• It uses entropy, a measure of pureness of the data

• The information gain Gain(S,A) of an attribute A relative to a collection of examples S measures the gain reduction in S due to splitting on A:

• Gain measures the expected reduction in entropy due to splitting on A

• The attribute with the higher entropy reduction is chosen

Knowledge Discovery in Databases I: Classification 25

∑∈

−=)(

)(||||)(),(

AValuesvv

v SEntropySSSEntropyASGain

Before splitting After splitting on A

Page 26: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Entropy

• Let S be a collection of positive and negative examples for a binary classification problem, C={+, -}.

• p+: the percentage of positive examples in S

• p-: the percentage of negative examples in S

• Entropy measures the impurity of S:

• Examples :– Let S: [9+,5-]

– Let S: [7+,7-]

– Let S: [14+,0-]

• Entropy = 0, when all members belong to the same class

• Entropy = 1, when there is an equal number of positive and negative examples

Knowledge Discovery in Databases I: Classification 26

)(log)(log)( 22 −−++ −−= ppppSEntropy

940.0)145(log

145)

149(log

149)( 22 =−−=SEntropy

1)147(log

147)

147(log

147)( 22 =−−=SEntropy

0)140(log

140)

1414(log

1414)( 22 =−−=SEntropy

in the general case (k-classification problem)

∑=

−=k

iii ppSEntropy

12 )(log)(

Page 27: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Information Gain example 1

Which attribute to choose next???

Knowledge Discovery in Databases I: Classification 27

Page 28: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Information Gain example 2

Knowledge Discovery in Databases I: Classification 28

Training set

Page 29: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Attribute selection measure: Gain ratio

• Information gain is biased towards attributes with a large number of values

– Consider the attribute ID (unique identifier)

• C4.5 (a successor of ID3) uses gain ratio to overcome the problem, which

normalizes the gain

• Example:

• Humidity={High, Low}

• Wind={Weak, Strong}

• Outlook = {Sunny, Overcast, Rain}

• The attribute with the maximum gain ratio is selected as the splitting attribute

Knowledge Discovery in Databases I: Classification 29

)||||(log

||||),( 2

)( SS

SSASmationSplitInfor v

AValuesv

v ×−= ∑∈

A)mation(S,SplitInforA)Gain(S, A)S,GainRatio( =

1)147(log

147)

147(log

147),( 22 =×−×−=HumiditySmationSplitInfor

9852.0)146(log

146)

148(log

148),( 22 =×−×−=WindSmationSplitInfor

5774.1)145(log

145)

144(log

144)

145(log

145),( 222 =×−×−×−=OutlookSmationSplitInfor

Measures the information w.r.t. classification

Measures the information generated by splitting S into

|Values(A)|partitions

•High split info: partitions have more or less the same size (uniform)• Low split info: few partitions hold most of the tuples (peaks)

Page 30: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Attribute selection measure: Gini Index (CART)

• Let a dataset S containing examples from k classes. Let pj be the probability of class j in S. The Gini Index of S is given by:

• Gini index considers a binary split for each attribute• If S is split based on attribute A into two subsets S1 and S2 :

• Reduction in impurity:

• The attribute A that provides the smallest Gini(S,A) (or the largest reduction in impurity) is chosen to split the node

• How to find the binary splits?– For discrete-valued attributes, we consider all possible subsets that can be formed by values of A– For numerical attributes, we find the split points (slides 41-42)

Knowledge Discovery in Databases I: Classification 30

∑=

−=k

jp jSGini

121)(

)(||||)(

||||),( 2

21

1 SGiniSSSGini

SSASGini +=

),()(),( ASGiniSGiniASGini −=∆

Page 31: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Gini index example

Let S has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions S into 10 in S1: {low, medium} and 4 in S2

The Gini Index measures of the remaining partitions

for the income attribute:

So, the best binary split for income is on {medium, high} and {low}

Knowledge Discovery in Databases I: Classification 31

459.0145

1491)(

22

=

−=Sgini

)(144)(

1410)( 11},{ DGiniDGiniDgini mediumlowincome

+

=∈

Page 32: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Comparing Attribute Selection Measures

• The three measures, are commonly used and in general, return good results but

– Information gain Gain(S,A):

• biased towards multivalued attributes

– Gain ratio GainRatio(S,A) :

• tends to prefer unbalanced splits in which one partition is much smaller than the others

– Gini index:

• biased to multivalued attributes

• has difficulty when # of classes is large

• tends to favor tests that result in equal-sized partitions and purity in both partitions

• Several other measures exist

Knowledge Discovery in Databases I: Classification 32

Page 33: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Hypothesis search space (by ID3)

Knowledge Discovery in Databases I: Classification 33

• Hypothesis space is complete

– Solution is surely in there

• Greedy approach

• No back tracking

– Local minima

• Outputs a single hypothesis

Page 34: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Space partitioning

Knowledge Discovery in Databases I: Classification 34

• Decision boundary: The border line between two neighboring regions of different classes

• Decision regions: Axis parallel hyper-rectangles

Page 35: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Comparing DTs/ partitionings

Knowledge Discovery in Databases I: Classification 35

Page 36: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Overfitting

Consider adding a noisy training example D15 to the training set

How the earlier tree (built upon D1-D14) would be effected?

Knowledge Discovery in Databases I: Classification 36

Training set

D15 Sunny Hot Normal Strong No

Page 37: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Overfitting

• An induced tree may overfit the training data

– Too many branches, some may reflect anomalies due to noise or outliers

– Poor accuracy for unseen samples

• Overfitting: Consider an hypothesis h

– errortrain(h): the error of h in training set

– errorD(h): the error of h in the entire distribution D of data

– Hypothesis h overfits training data if there is an alternative hypothesis h’ in

H such that:

Knowledge Discovery in Databases I: Classification 37

Page 38: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Overfitting

Κνοωλεδγε ∆ισχοϖερψ ιν ∆αταβασεσ Ι: Χλασσιφιχατιον 38

Page 39: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Avoiding overfitting

• Two approaches to avoid overfitting

– Prepruning: Halt tree construction early—do not split a node if this would

result in the goodness measure falling below a threshold

• Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a “fully grown” tree—get a sequence

of progressively pruned trees

• Use a set of data different from the training data to decide which is the

“best pruned tree”

– Test set

Knowledge Discovery in Databases I: Classification 39

Page 40: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Effect of prunning

Knowledge Discovery in Databases I: Classification 40

Page 41: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Dealing with continuous-valued attributes

• Let attribute A be a continuous-valued attribute

• Must determine the best split point t for A, (A ≤ t)

– Sort the value A in increasing order

– Identify adjacent examples that differ in their target classification

• Typically, every such pair suggests a potential split threshold t= (ai+ai+1)/2

– Select threshold t that yields the best value of the splitting criterion.

Knowledge Discovery in Databases I: Classification 41

2 potential thresholds:Temperature>54, Temperature >85

Compute the attribute selection measure (e.g. information gain) for bothChoose the best (Temperature>54 here)

t=(48+60)/2=54 t =(80+90)/2=85

Page 42: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Continuous-valued attributes cont’

• Let t be the threshold chosen from the previous step

• Create a boolean attribute based on A and threshold t with two possible outcomes: yes, no

– S1 is the set of tuples in S satisfying (A >t), and S2 is the set of tuples in S satisfying (A ≤ t)

Knowledge Discovery in Databases I: Classification 42

Temperature>54

yes no

Temperature

>54 ≤54orHow it looks

An example of a tree for the play tennis problem when

attributes Humidity and Wind are continuous

Page 43: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

When to consider decision trees

• Instances are represented by attribute-value pairs– Instances are represented by a fixed number of attributes, e.g. outlook, humidity, wind and

their values, e.g. (wind=strong, outlook =rainy, humidity=normal)

– The easiest situation for a DT is when attributes take a small number of disjoint possible values, e.g. wind={strong, weak}

– There are extensions for numerical attributes also, e.g. temperature, income.

• The class attribute has discrete output values– Usually binary classification, e.g. {yes, no}, but also for more class values, e.g. {pos, neg,

neutral}

• The training data might contain errors– DTs are robust to errors: both errors in the class values of the training examples and in the

attribute values of these examples

• The training data might contain missing values• DTs can be used even when some training examples have some unknown attribute values

Knowledge Discovery in Databases I: Classification 43

Page 44: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Outline

• Introduction

• The classification process

• Classification (supervised) vs clustering (unsupervised)

• Decision trees

• Evaluation of classifiers

• Things you should know

• Homework/tutorial

Knowledge Discovery in Databases I: Classification 44

Page 45: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation

• The quality of a classifier is evaluated over a test set, different from the training set

• For each instance in the test set, we know its true class label

• Compare the predicted class (by some classifier) with the true class of the test instances

• Terminology– Positive tuples: tuples of the main class of interest

– Negative tuples: all other tuples

• A useful tool for analyzing how well a classifier performs is the confusion matrix

• For an m-class problem, the matrix is of size m x m

• An example of a matrix for a 2-class problem:

Knowledge Discovery in Databases I: Classification 45

C1 C2 totals

C1 TP (true positive) FN (false negative) P

C2 FP(false positive) TN (true negative) N

Totals P’ N’

Predicted class

Act

ual

clas

s

Page 46: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation measures

• Accuracy/ Recognition rate: % of test set instances correctly classified

• Error rate/ Missclassification rate: error_rate(M)=1-accuracy(M)

• More effective when the class distribution is relatively balanced

Knowledge Discovery in Databases I: Classification 46

NPTNTPMaccuracy

++

=)(

classes buy_computer = yes buy_computer = no total recognition(%)

buy_computer = yes 6954 46 7000

buy_computer = no 412 2588 3000

total 7366 2634 10000 95.42

NPFNFPMaccuracy

++

=)(

C1 C2 totals

C1 TP (true positive) FN (false negative) P

C2 FP(false positive) TN (true negative) N

Totals P’ N’

Page 47: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation measures cont’

If classes are imbalanced:

• Sensitivity/ True positive rate/ recall: % of positive tuples that are correctly recognized

• Specificity/ True negative rate : % of negative tuples that are correctly recognized

Knowledge Discovery in Databases I: Classification 47

PTPMysensitivit =)(

NTNMyspecificit =)(

classes buy_computer = yes buy_computer = no total recognition(%)

buy_computer = yes 6954 46 7000 99.34

buy_computer = no 412 2588 3000 86.27

total 7366 2634 10000 95.42

C1 C2 totals

C1 TP (true positive) FN (false negative) P

C2 FP(false positive) TN (true negative) N

Totals P’ N’

Page 48: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation measures cont’

• Precision: % of tuples labeled as positive which are actually positive

• Recall: % of positive tuples labeled as positive

– Precision does not say anything about misclassified instances

– Recall does not say anything about possible instances from other classes labeled as positive

• F-measure/ F1 score/F-score combines both

• Fβ-measure is a weighted measure of precision and recall

Knowledge Discovery in Databases I: Classification 48

FPTPTPMprecision+

=)(

PTP

FNTPTPMrecall =+

=)(

C1 C2 totals

C1 TP (true positive) FN (false negative) P

C2 FP(false positive) TN (true negative) N

Totals P’ N’

)()()(*)(*2)(

MrecallMprecisionMrecallMprecisionMF

+=

It is the harmonic mean of precision and recall

)()(*)(*)(*)1()( 2

2

MrecallMprecisionMrecallMprecisionMF

++

ββ

Common values for β:β=2β=0.5

Page 49: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation methods

• Holdout method– Given data is randomly partitioned into two independent sets

• Training set (e.g., 2/3) for model construction

• Test set (e.g., 1/3) for accuracy estimation

– It takes no longer to compute (+)

– It depends on how data are divided (-)

– Random sampling: a variation of holdout• Repeat holdout k times, accuracy is the avg accuracy obtained

Knowledge Discovery in Databases I: Classification 49

Page 50: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation methods cont’

• Cross-validation (k-fold cross validation, k = 10 usually)– Randomly partition the data into k mutually exclusive subsets D1, …, Dk

each approximately equal size

– Training and testing is performed k times• At the i-th iteration, use Di as test set and others as training set

– Accuracy is the avg accuracy over all iterations

– Does not rely so much on how data are divided (+)

– The algorithm should re-run from scratch k times (-)

– Leave-one-out: k folds where k = #of tuples, so only one sample is used as a test set at a time; for small sized data

– Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately the same as that in the initial data

• Stratified 10 fold cross-validation is recommended

Knowledge Discovery in Databases I: Classification 50

Page 51: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation methods cont’

• Bootstrap: Samples the given training data uniformly with replacement– i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set

– Works well with small data sets

• Several boostrap methods, and a common one is .632 boostrap– Suppose we are given a data set of #d tuples.

– The data set is sampled #d times, with replacement, resulting in a training set of #d samples (also known as bootstrap sample):

• It is very likely that some of the original tuples will occur more than once in this set

– The data tuples that did not make it into the training set end up forming the test set.

– On average, 36.8 of the tuples will not be selected for training and thereby end up in the test set; the remaining 63.2 will form the train test

• Each sample has a probability 1/d of being selected and (1-1/d) of not being chosen. We repeat d times, so the probability for a tuple to not be chosen during the whole period is (1-1/d)d.

• For large d:

– Repeat the sampling procedure k times, report the overall accuracy of the model:

Knowledge Discovery in Databases I: Classification 51

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc ×+×= ∑

=

368.011 1 ≈≈

− −e

n

n

Accuracy of the model obtained by bootstrap sample i when it is applied on test set i.

Accuracy of the model obtained by bootstrap sample i when it is applied over all cases

Page 52: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Classifier evaluation summary

• Accuracy measures

– accuracy, error rate, sensitivity, specificity, precision, F-score, Fβ

• Other parameters

– Speed (construction time, usage time)

– Robustness to noise, outliers and missing values

– Scalability for large data sets

– Interpretability from humans

Knowledge Discovery in Databases I: Classification 52

Page 53: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Things you should know

• What is classification

• Class attribute, attributes

• Train set, test set, new unknown instances

• Supervised vs unsupervised

• Decision tree induction algorithm

• Choosing the best attribute for splitting

• Overfitting

• Dealing with continuous attributes

• Evaluation of classifiers

Knowledge Discovery in Databases I: Data Preprocessing / Feature spaces 53

Page 54: Lecture notes Knowledge Discovery in Databases · Information gain • Used in ID3 • It uses entropy, a measure of pureness of the data • The information gain Gain(S,A) of an

DATABASESYSTEMSGROUP

Homework/ Tutorial

Tutorial: No tutorial this Thursday (Christi Himmelfahrt)– Repeat exercises from the previous tutorials

– Get familiar with Weka/ Elki/ R/ SciPy.

Homework:– Run decision tree classification in Weka

– Implement a decision tree classifier

Suggested reading:– Han J., Kamber M., Pei J. Data Mining: Concepts and Techniques 3rd ed., Morgan

Kaufmann, 2011 (Chapter 8)

– Tom Mitchel, Machine Learning, McGraw Hill, 1997 (Chapter 3)

Knowledge Discovery in Databases I: Frequent Itemsets Mining & Association Rules 54