Upload
naveen-jaishankar
View
114
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Data Mining Revision
Citation preview
Types of AttributesNominal– Examples: ID numbers, eye color, zip codes
Ordinal– Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
Interval– Example: calendar dates
Ratio– Examples: temperature in Kelvin, length, time, counts
Q2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio(a)Time in terms of AM or PM.
Binary, qualitative, ordinal(b) Brightness as measured by a light meter.
Continuous, quantitative, ratio(c) Brightness as measured by people’s judgments.
Discrete, qualitative, ordinal(d) Angles as measured in degrees between 0◦ and 360◦.
Continuous, quantitative, ratio
(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal
(f) Height above sea level. Continuous, quantitative, interval/ratio (depends on whether sea level is regarded as an arbitrary origin)
(g) Number of patients in a hospital. Discrete, quantitative, ratio
(h) ISBN numbers for books. (Look up the format on the Web.)
Discrete, qualitative, nominal (ISBN numbers do have order
information, though)
(i) Ability to pass light in terms of the following values: opaque,
translucent, transparent.
Discrete, qualitative, ordinal
(j) Military rank.
Discrete, qualitative, ordinal
(k) Distance from the center of campus.
Continuous, quantitative, interval/ratio
(l) Density of a substance in grams per cubic centimeter.
Discrete, quantitative, ratio
(m) Coat check number. (When you attend an event, you can often
give your coat to someone who, in turn, gives you a number that
you can use to claim your coat when you leave.)
Discrete, qualitative, nominal
18. This exercise compares and contrasts some similarity and distance measures.
(a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.
x = 0101010001y = 0100011000 Hamming distance = number of different bits = 3Jaccard Similarity = number of 1-1 matches /( number of bits number 0-0 matches) = 2 / 5 = 0.4
(b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain
The Hamming distance is similar to the SMC. In fact, SMC = Hamming distance / number of bits.The Jaccard measure is similar to the cosine
measure because both ignore 0-0 matches.
Cosine Similarity
Example: x = 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0, 0 y = 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0, 2 x • y= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5||x|| = sqrt (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) = 6.481|| y|| = sqrt (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) = 2.245cos( x, y ) =0 .3150
• If cosine similarity =1 ; angle between x and y = 00
• If cosine similarity =0 ; angle between x and y = 900
Correlation• Measures the linear relationship between attributes of
objects• To compute correlation, we standardize data objects, p
and q, and then take their dot product
• corr ( x, y) =
= sxy
sx sy
covariance (x , y) =
standard deviation = sx =
standard deviation = sy =
19. For the following vectors, x and y, calculate the
indicated similarity or distance measures.
(a)x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine,
correlation, Jaccard
cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6
(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation,
Euclidean, Jaccard
cos(x, y) = 0, corr(x, y) = −1, Euclidean(x, y) = 2,
Jaccard(x, y) = 0
(c) x = (0,−1, 0, 1), y = (1, 0,−1, 0) cosine, correlation,
Euclidean
cos(x, y) = 0, corr(x, y)=0, Euclidean(x, y) = 2
(d) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation,
Euclidean
cos(x, y) = 1, corr(x, y) = 0/0 (undefined),
Euclidean(x, y) = 2
(e) x = (2,−1, 0, 2, 0,−3), y = (−1, 1,−1, 0, 0,−1) cosine,
correlation
cos(x, y) = 0, corr(x, y) = 0
Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
• Binary split: Divides values into two subsets. Need to find optimal partitioning.
• Decision tree algorithm like CART produce 2k-1
CarTypeFamily
Sports
Luxury
CarType{Family, Luxury} {Sports}
CarType{Sports, Luxury} {Family} OR
Measures for selecting the Best Split
)|( tjp
)|( tip Fraction of records belonging to class i at a given node t
• Measures for selecting the best split is based on the degree of
impurity of child nodes
• Smaller the degree of impurity the more skewed the class
distribution
• Node with class distribution (0,1) has impurity=0; node with
uniform distribution (0.5, 0.5) has highest impurity
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
i
tiptGINI 2)]|([1)(
i
tiptiptEntropy )|(log)|()( 2
(NOTE: p( j | t) is the relative frequency of class j at node t).
)]/([max1)( tipttionerrorClassifica iWhere c is the number of classes and 0 log2 0 = 0 is entropy calculations
Measure of Impurity: GINI• Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
– Minimum (0.0) when all records belong to one class, implying most
interesting informationC1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
i
tiptGINI 2)]|([1)(
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
P(C1) = 3/6 P(C2) = 3/6
Gini = 1 – (3/6)2 – (3/6)2 =0.5
Alternative Splitting Criteria based on INFO
• Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed among all
classes implying least information
• Minimum (0.0) when all records belong to one class, implying most
information
– Entropy based computations are similar to the GINI index computations
i
tiptiptEntropy )|(log)|()( 2
Examples for computing Entropy
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
i
tiptiptEntropy )|(log)|()( 2
Splitting Criteria based on Classification Error
• Classification error at a node t :
• Measures misclassification error made by a node.
• Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information
)]/([max1)( tipttionerrorClassifica i
Examples for Computing Error
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)]/([max1)( tipttionerrorClassifica i
Information Gain to find best split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10 C1 N11
C0 N20 C1 N21
C0 N30 C1 N31
C0 N40 C1 N41
C0 N00 C1 N01
M0
M1 M2 M3 M4
M12 M34Difference in entropy gives information Gain
Gain = M0 – M12 vs M0 – M34
Splitting Based on INFO...
• Information Gain:
Parent Node, p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose the
split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
k
i
i
splitiEntropy
nn
pEntropyGAIN1
)()(
a) Compute the Gini index for the overall collection of training examples.
Answer: Gini = 1 − 2 × 0.52 = 0.5.(b) Compute the Gini index for the Customer ID attribute.Answer: The gini for each Customer ID value is 0. Therefore, the
overall gini for Customer ID is 0.(c) Compute the Gini index for the Gender attribute.Answer:The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5.
Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5.(d) Compute the Gini index for the Car Type attribute using multiway
split.Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury
car is 0.2188. The overall gini is 0.1625.
(e) Compute the Gini index for the Shirt Size attribute using multiway split.
Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for Shirt Size attribute is 0.4914.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?Answer:Car Type because it has the lowest gini among the three
attributes.(g) Explain why Customer ID should not be used as the attribute
test condition even though it has the lowest Gini.Answer:The attribute has no predictive power since new customers are
assigned to new Customer IDs.
3. Consider the training examples shown in Table 4.2 for a binary classification problem.
(a)What is the entropy of this collection of training examples with respect to the positive class?
Answer: There are four positive examples and five negative examples. Thus, P(+) = 4/9 and P(−) = 5/9. The entropy of the training examples is −4/9 log2(4/9) − 5/9 log2(5/9) = 0.9911.
(b) What are the information gains of a1 and a2 relative to these training examples?
Answer: For attribute a1, the corresponding counts and probabilities are: