Revision Chapter3

Types of AttributesNominal– Examples: ID numbers, eye color, zip codes

Ordinal– Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

Interval– Example: calendar dates

Ratio– Examples: temperature in Kelvin, length, time, counts

Q2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio(a)Time in terms of AM or PM.

Binary, qualitative, ordinal(b) Brightness as measured by a light meter.

Continuous, quantitative, ratio(c) Brightness as measured by people’s judgments.

Discrete, qualitative, ordinal(d) Angles as measured in degrees between 0◦ and 360◦.

Continuous, quantitative, ratio

(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal

(f) Height above sea level. Continuous, quantitative, interval/ratio (depends on whether sea level is regarded as an arbitrary origin)

(g) Number of patients in a hospital. Discrete, quantitative, ratio

(h) ISBN numbers for books. (Look up the format on the Web.)

Discrete, qualitative, nominal (ISBN numbers do have order

information, though)

(i) Ability to pass light in terms of the following values: opaque,

translucent, transparent.

Discrete, qualitative, ordinal

(j) Military rank.

Discrete, qualitative, ordinal

(k) Distance from the center of campus.

Continuous, quantitative, interval/ratio

(l) Density of a substance in grams per cubic centimeter.

Discrete, quantitative, ratio

(m) Coat check number. (When you attend an event, you can often

give your coat to someone who, in turn, gives you a number that

you can use to claim your coat when you leave.)

Discrete, qualitative, nominal

18. This exercise compares and contrasts some similarity and distance measures.

(a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

x = 0101010001y = 0100011000 Hamming distance = number of different bits = 3Jaccard Similarity = number of 1-1 matches /( number of bits number 0-0 matches) = 2 / 5 = 0.4

(b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain

The Hamming distance is similar to the SMC. In fact, SMC = Hamming distance / number of bits.The Jaccard measure is similar to the cosine

measure because both ignore 0-0 matches.

Cosine Similarity

Example: x = 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0, 0 y = 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0, 2 x • y= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5||x|| = sqrt (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) = 6.481|| y|| = sqrt (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) = 2.245cos( x, y ) =0 .3150

• If cosine similarity =1 ; angle between x and y = 00

• If cosine similarity =0 ; angle between x and y = 900

Correlation• Measures the linear relationship between attributes of

objects• To compute correlation, we standardize data objects, p

and q, and then take their dot product

• corr ( x, y) =

= sxy

sx sy

covariance (x , y) =

standard deviation = sx =

standard deviation = sy =

19. For the following vectors, x and y, calculate the

indicated similarity or distance measures.

(a)x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine,

correlation, Jaccard

cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6

(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation,

Euclidean, Jaccard

cos(x, y) = 0, corr(x, y) = −1, Euclidean(x, y) = 2,

Jaccard(x, y) = 0

(c) x = (0,−1, 0, 1), y = (1, 0,−1, 0) cosine, correlation,

Euclidean

cos(x, y) = 0, corr(x, y)=0, Euclidean(x, y) = 2

(d) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation,

Euclidean

cos(x, y) = 1, corr(x, y) = 0/0 (undefined),

Euclidean(x, y) = 2

(e) x = (2,−1, 0, 2, 0,−3), y = (−1, 1,−1, 0, 0,−1) cosine,

correlation

cos(x, y) = 0, corr(x, y) = 0

Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets. Need to find optimal partitioning.

• Decision tree algorithm like CART produce 2k-1

CarTypeFamily

Sports

Luxury

CarType{Family, Luxury} {Sports}

CarType{Sports, Luxury} {Family} OR

Measures for selecting the Best Split

)|( tjp

)|( tip Fraction of records belonging to class i at a given node t

• Measures for selecting the best split is based on the degree of

impurity of child nodes

• Smaller the degree of impurity the more skewed the class

distribution

• Node with class distribution (0,1) has impurity=0; node with

uniform distribution (0.5, 0.5) has highest impurity

Measures of Node Impurity

• Gini Index

• Entropy

• Misclassification error

i

tiptGINI 2)]|([1)(

i

tiptiptEntropy )|(log)|()( 2

(NOTE: p( j | t) is the relative frequency of class j at node t).

)]/([max1)( tipttionerrorClassifica iWhere c is the number of classes and 0 log2 0 = 0 is entropy calculations

Measure of Impurity: GINI• Gini Index for a given node t :


– Maximum (1 - 1/nc) when records are equally distributed among all classes,

implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most

interesting informationC1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

i

tiptGINI 2)]|([1)(

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

P(C1) = 3/6 P(C2) = 3/6

Gini = 1 – (3/6)2 – (3/6)2 =0.5

Alternative Splitting Criteria based on INFO

• Entropy at a given node t:


– Measures homogeneity of a node.

• Maximum (log nc) when records are equally distributed among all

classes implying least information

• Minimum (0.0) when all records belong to one class, implying most

information

– Entropy based computations are similar to the GINI index computations

i


Examples for computing Entropy

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

i


Splitting Criteria based on Classification Error

• Classification error at a node t :

• Measures misclassification error made by a node.

• Maximum (1 - 1/nc) when records are equally distributed among all

classes, implying least interesting information

• Minimum (0.0) when all records belong to one class, implying most

interesting information

)]/([max1)( tipttionerrorClassifica i

Examples for Computing Error

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

)]/([max1)( tipttionerrorClassifica i

Information Gain to find best split

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

Before Splitting:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M0

M1 M2 M3 M4

M12 M34Difference in entropy gives information Gain

Gain = M0 – M12 vs M0 – M34

Splitting Based on INFO...

• Information Gain:

Parent Node, p is split into k partitions;

ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the split. Choose the

split that achieves most reduction (maximizes GAIN)

– Used in ID3 and C4.5

– Disadvantage: Tends to prefer splits that result in large number of

partitions, each being small but pure.

k

i

i

splitiEntropy

nn

pEntropyGAIN1

)()(

a) Compute the Gini index for the overall collection of training examples.

Answer: Gini = 1 − 2 × 0.52 = 0.5.(b) Compute the Gini index for the Customer ID attribute.Answer: The gini for each Customer ID value is 0. Therefore, the

overall gini for Customer ID is 0.(c) Compute the Gini index for the Gender attribute.Answer:The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5.

Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5.(d) Compute the Gini index for the Car Type attribute using multiway

split.Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury

car is 0.2188. The overall gini is 0.1625.

(e) Compute the Gini index for the Shirt Size attribute using multiway split.

Answer: The gini for Small shirt size is 0.48, Medium shirt size is 0.4898, Large shirt size is 0.5, and Extra Large shirt size is 0.5. The overall gini for Shirt Size attribute is 0.4914.

(f) Which attribute is better, Gender, Car Type, or Shirt Size?Answer:Car Type because it has the lowest gini among the three

attributes.(g) Explain why Customer ID should not be used as the attribute

test condition even though it has the lowest Gini.Answer:The attribute has no predictive power since new customers are

assigned to new Customer IDs.

3. Consider the training examples shown in Table 4.2 for a binary classification problem.

(a)What is the entropy of this collection of training examples with respect to the positive class?

Answer: There are four positive examples and five negative examples. Thus, P(+) = 4/9 and P(−) = 5/9. The entropy of the training examples is −4/9 log2(4/9) − 5/9 log2(5/9) = 0.9911.

(b) What are the information gains of a1 and a2 relative to these training examples?

Answer: For attribute a1, the corresponding counts and probabilities are:

Documents

Revision Chapter3