37
Types of Attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Example: calendar dates Ratio Examples: temperature in Kelvin, length, time, counts

Revision Chapter3

Embed Size (px)

DESCRIPTION

Data Mining Revision

Citation preview

Page 1: Revision Chapter3

Types of AttributesNominal– Examples: ID numbers, eye color, zip codes

Ordinal– Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

Interval– Example: calendar dates

Ratio– Examples: temperature in Kelvin, length, time, counts

Page 2: Revision Chapter3
Page 3: Revision Chapter3
Page 4: Revision Chapter3

Q2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio(a)Time in terms of AM or PM.

Binary, qualitative, ordinal(b) Brightness as measured by a light meter.

Continuous, quantitative, ratio(c) Brightness as measured by people’s judgments.

Discrete, qualitative, ordinal(d) Angles as measured in degrees between 0◦ and 360◦.

Continuous, quantitative, ratio

Page 5: Revision Chapter3

(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal

(f) Height above sea level. Continuous, quantitative, interval/ratio (depends on whether sea level is regarded as an arbitrary origin)

(g) Number of patients in a hospital. Discrete, quantitative, ratio

(h) ISBN numbers for books. (Look up the format on the Web.)

Discrete, qualitative, nominal (ISBN numbers do have order

information, though)

(i) Ability to pass light in terms of the following values: opaque,

translucent, transparent.

Discrete, qualitative, ordinal

Page 6: Revision Chapter3

(j) Military rank.

Discrete, qualitative, ordinal

(k) Distance from the center of campus.

Continuous, quantitative, interval/ratio

(l) Density of a substance in grams per cubic centimeter.

Discrete, quantitative, ratio

(m) Coat check number. (When you attend an event, you can often

give your coat to someone who, in turn, gives you a number that

you can use to claim your coat when you leave.)

Discrete, qualitative, nominal

Page 7: Revision Chapter3
Page 8: Revision Chapter3
Page 9: Revision Chapter3
Page 10: Revision Chapter3
Page 11: Revision Chapter3
Page 12: Revision Chapter3
Page 13: Revision Chapter3
Page 14: Revision Chapter3

18. This exercise compares and contrasts some similarity and distance measures.

(a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

x = 0101010001y = 0100011000 Hamming distance = number of different bits = 3Jaccard Similarity = number of 1-1 matches /( number of bits number 0-0 matches) = 2 / 5 = 0.4

Page 15: Revision Chapter3

(b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain

The Hamming distance is similar to the SMC. In fact, SMC = Hamming distance / number of bits.The Jaccard measure is similar to the cosine

measure because both ignore 0-0 matches.

Page 16: Revision Chapter3

Cosine Similarity

Page 17: Revision Chapter3

Example: x = 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0, 0 y = 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0, 2 x • y= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5||x|| = sqrt (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) = 6.481|| y|| = sqrt (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) = 2.245cos( x, y ) =0 .3150

• If cosine similarity =1 ; angle between x and y = 00

• If cosine similarity =0 ; angle between x and y = 900

Page 18: Revision Chapter3

Correlation• Measures the linear relationship between attributes of

objects• To compute correlation, we standardize data objects, p

and q, and then take their dot product

• corr ( x, y) =

= sxy

sx sy

Page 19: Revision Chapter3

covariance (x , y) =

standard deviation = sx =

standard deviation = sy =

Page 20: Revision Chapter3

19. For the following vectors, x and y, calculate the

indicated similarity or distance measures.

(a)x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine,

correlation, Jaccard

cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6

(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation,

Euclidean, Jaccard

cos(x, y) = 0, corr(x, y) = −1, Euclidean(x, y) = 2,

Jaccard(x, y) = 0

Page 21: Revision Chapter3

(c) x = (0,−1, 0, 1), y = (1, 0,−1, 0) cosine, correlation,

Euclidean

cos(x, y) = 0, corr(x, y)=0, Euclidean(x, y) = 2

(d) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation,

Euclidean

cos(x, y) = 1, corr(x, y) = 0/0 (undefined),

Euclidean(x, y) = 2

(e) x = (2,−1, 0, 2, 0,−3), y = (−1, 1,−1, 0, 0,−1) cosine,

correlation

cos(x, y) = 0, corr(x, y) = 0

Page 22: Revision Chapter3

Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets. Need to find optimal partitioning.

• Decision tree algorithm like CART produce 2k-1

CarTypeFamily

Sports

Luxury

CarType{Family, Luxury} {Sports}

CarType{Sports, Luxury} {Family} OR

Page 23: Revision Chapter3

Measures for selecting the Best Split

)|( tjp

)|( tip Fraction of records belonging to class i at a given node t

• Measures for selecting the best split is based on the degree of

impurity of child nodes

• Smaller the degree of impurity the more skewed the class

distribution

• Node with class distribution (0,1) has impurity=0; node with

uniform distribution (0.5, 0.5) has highest impurity

Page 24: Revision Chapter3

Measures of Node Impurity

• Gini Index

• Entropy

• Misclassification error

i

tiptGINI 2)]|([1)(

i

tiptiptEntropy )|(log)|()( 2

(NOTE: p( j | t) is the relative frequency of class j at node t).

)]/([max1)( tipttionerrorClassifica iWhere c is the number of classes and 0 log2 0 = 0 is entropy calculations

Page 25: Revision Chapter3

Measure of Impurity: GINI• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes,

implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most

interesting informationC1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

i

tiptGINI 2)]|([1)(

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

P(C1) = 3/6 P(C2) = 3/6

Gini = 1 – (3/6)2 – (3/6)2 =0.5

Page 26: Revision Chapter3

Alternative Splitting Criteria based on INFO

• Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node.

• Maximum (log nc) when records are equally distributed among all

classes implying least information

• Minimum (0.0) when all records belong to one class, implying most

information

– Entropy based computations are similar to the GINI index computations

i

tiptiptEntropy )|(log)|()( 2

Page 27: Revision Chapter3

Examples for computing Entropy

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

i

tiptiptEntropy )|(log)|()( 2

Page 28: Revision Chapter3

Splitting Criteria based on Classification Error

• Classification error at a node t :

• Measures misclassification error made by a node.

• Maximum (1 - 1/nc) when records are equally distributed among all

classes, implying least interesting information

• Minimum (0.0) when all records belong to one class, implying most

interesting information

)]/([max1)( tipttionerrorClassifica i

Page 29: Revision Chapter3

Examples for Computing Error

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

)]/([max1)( tipttionerrorClassifica i

Page 30: Revision Chapter3

Information Gain to find best split

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

Before Splitting:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M0

M1 M2 M3 M4

M12 M34Difference in entropy gives information Gain

Gain = M0 – M12 vs M0 – M34

Page 31: Revision Chapter3

Splitting Based on INFO...

• Information Gain:

Parent Node, p is split into k partitions;

ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the split. Choose the

split that achieves most reduction (maximizes GAIN)

– Used in ID3 and C4.5

– Disadvantage: Tends to prefer splits that result in large number of

partitions, each being small but pure.

k

i

i

splitiEntropy

nn

pEntropyGAIN1

)()(

Page 32: Revision Chapter3
Page 33: Revision Chapter3

a) Compute the Gini index for the overall collection of training examples.

Answer: Gini = 1 − 2 × 0.52 = 0.5.(b) Compute the Gini index for the Customer ID attribute.Answer: The gini for each Customer ID value is 0. Therefore, the

overall gini for Customer ID is 0.(c) Compute the Gini index for the Gender attribute.Answer:The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5.

Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5.(d) Compute the Gini index for the Car Type attribute using multiway

split.Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury

car is 0.2188. The overall gini is 0.1625.

Page 34: Revision Chapter3

(e) Compute the Gini index for the Shirt Size attribute using multiway split.

Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for Shirt Size attribute is 0.4914.

(f) Which attribute is better, Gender, Car Type, or Shirt Size?Answer:Car Type because it has the lowest gini among the three

attributes.(g) Explain why Customer ID should not be used as the attribute

test condition even though it has the lowest Gini.Answer:The attribute has no predictive power since new customers are

assigned to new Customer IDs.

Page 35: Revision Chapter3

3. Consider the training examples shown in Table 4.2 for a binary classification problem.

(a)What is the entropy of this collection of training examples with respect to the positive class?

Answer: There are four positive examples and five negative examples. Thus, P(+) = 4/9 and P(−) = 5/9. The entropy of the training examples is −4/9 log2(4/9) − 5/9 log2(5/9) = 0.9911.

Page 36: Revision Chapter3

(b) What are the information gains of a1 and a2 relative to these training examples?

Answer: For attribute a1, the corresponding counts and probabilities are:

Page 37: Revision Chapter3