Upload
shavonne-reeves
View
224
Download
3
Tags:
Embed Size (px)
Citation preview
2
Classification
Stages in classification Model construction: Given a collection of records (training
set), where each record has a set of attributes, including the class, we want to find a model (classifier) for predicting the class as a function of other attributes.
Model evaluation: Use previously unseen records (test set) to test the model, and hopefully the model should be able to assign a class as accurately as possible.
Model application: Apply the model directly.
3
Stages in Classification
Stages in Classification
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
5
Examples Classification/Regression Tasks
Classification Predict the trend (up or down) of stock markets Predict tumors as benign or malignant Classify credit card transactions as legitimate or fraudulent Categorize news articles as finance, weather, entertainment,
sports, etc. Regression
Predict the temperature in 3 hours from now Predict tomorrow’s gold/oil price Estimate the paths of typhoon
6
Methods for Classification
Numerous methods for classification Decision trees Minimum-distance classifiers Artificial neural networks Naïve Bayes classifiers Quadratic classifiers Gaussian-mixture-model classifiers Support vector machines Rule-based methods …
7
Decision Tree Induction
Again, many algorithms Hunt’s algorithm (one of the earliest) CART (classification and regression trees) ID3, C4.5 SLIQ, SPRINT …
8
Decision Tree Induction
Again, many algorithms Hunt’s algorithm (one of the earliest) CART (classification and regression trees) ID3, C4.5 SLIQ, SPRINT …
9
General Steps in Tree Induction
Idea We want to send all the training data along the tree until it
reach the leaves where the data should be as “pure” as possible.
Let D be the data set that reach a node General procedure
If D contains records belonging to the same class y, then mark the node as a leaf with class y.
Otherwise use a test to split the data set based on an attribute to create subtree recursively.
10
Tree Induction
Issues in tree induction How to split the dataset at a node: Split the dataset based on
a greedy search to optimize a certain criterion/test When to stop splitting: When the “impurity measure” is less
than a threshold
11
How to Specify Test?
Depends on attribute types Nominal
Car types: Family, sports, luxury, etc Ordinal
T-shirt size: Small, median big, etc Continuous
Temperature: 10.3, 25.6, 38, etc Depends on number of ways to split
Binary (2-way) split Multi-way split
Aka “factor”
12
Splitting Based on Nominal/Ordinal Attributes
Multi-way split Use as many partitions as distinct values
Binary split Divides values into two subsets via optimal partitioning
CarTypeFamily
SportsLuxury
CarType{Family, Luxury} {Sports}
CarType{Sports, Luxury} {Family}
OR
13
Splitting Based on Continuous Attributes
Multi-way split Discretization to form an ordinal categorical attribute
Binary split (A<v or A v) Consider all possible splits to find the best one
TaxableIncome> 80K?
Yes No
TaxableIncome?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
14
To Determine the Best Split
Goal Nodes with homogeneous (pure) class distribution are
preferred Need a measure of node impurity (which should be keep
as low as possible during split selection)
C0: 5C1: 5
C0: 9C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
15
Measures of Node Impurity
Numerous measures of node impurity Gini index
Entropy
Classification error
j
tjptGini 2)|(1)(
j
tjptjptEntropy )|(log)|()( 2
)|(max1)( tjptErrorj
For 2-class problem
16
Impurity Measure: Gini Index
Gini index for a given node t:
Extreme values Minimum = 0 Maximum = 1/(# of classes)
Examples
j
tjptGini 2)|(1)(
P(j|t) is the relative frequencyof class j at node t
C1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
All records in the same class
Records equally distributed among all classes
06
601
2
22
278.06
511
2
22
444.06
421
2
22
5.06
331
2
22
“confusion” in HW4
17
Splitting Based on Gini Index
The quality of splitting a node t into k childrens
ti = node of child i ni = number of records at ti
n = number of records at note t
k
ii
isplit tGini
n
ntGini
1
)(
“total confusion” in HW4
18
Gini Index for General Binary Split
Example for computing Gini index for binary split
B?
Yes No
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2 C1 5 1
C2 2 4
Gini=0.333
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194
Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
Ginisplit(B) = 7/12 * 0.194 + 5/12 * 0.528= 0.333
19
Gini Index for Nominal Attributes
For each child, obtain counts for each class Compute the Gini index for each child Compute the Gini index for the split
CarType{Sports,Luxury}
{Family}
C1 3 1
C2 2 4
Gini 0.400
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split (find best partition of values)
CarType
{Sports}{Family,Luxury}
C1 2 2
C2 1 5
Gini 0.419
20
Gini Index for Binary Split on Continuous Attributes
For each attribute Sort the attribute values Linearly scan these value, and update the count matrix and
compute Gini index for a new value each time Choose the split that has the smallest Gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values