Classification algorithms used in Data Mining. This is a lecture given to Msc students

SUSHIL KULKARNISUSHIL KULKARNI


CLASSIFICATIONCLASSIFICATIONIN DATA MININGIN DATA MINING


ClassificationClassification What is classification?

Model Construction

ID3

Information Theory

Naïve Baysian Classifier


CLASSIFICATION CLASSIFICATION PROBLEMPROBLEM


CLASSIFICATION PROBLEMCLASSIFICATION PROBLEM

ӂGiven a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f: DC where each ti is assigned to one class.

ӂ Problem is to create classes to classify data with the help of given set of data called training set.


CLASSIFICATION EXAMPLESCLASSIFICATION EXAMPLES

ӂ Teachers classify students’ grades as A,

B, C, D, or F.

ӂ Identify mushrooms as poisonous or

edible.

ӂ Identify individuals with credit risks.


Why Classification? A motivating Why Classification? A motivating applicationapplication

ӂ Credit approval

o A bank wants to classify its customers based on whether they are expected to pay back their approved loans

o The history of past customers is used to train the classifier

o The classifier provides rules, which identify potentially reliable future customers


Why Classification? A motivating Why Classification? A motivating applicationapplication

ӂ Credit approvalo Classification rule:

If age = “31...40” and income = high then credit_rating = excellent

o Future customers

Suhas : age = 35, income = high excellent

credit rating

Heena : age = 20, income = medium fair

credit rating


Classification — A Two-Step Classification — A Two-Step ProcessProcess

Model construction: describing a set of predetermined classes: Excellent and Fair using training set.

Model is represented using classification rules.


Supervised LearningSupervised Learning

Supervised learning (classification)

o Supervision: The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class of the observations

o New data is classified based on the training set


Classification Process (1): Classification Process (1): Model ConstructionModel Construction

TrainingData

NAME RANK YEARS TEACH Henna Assistant Prof 3 noLeena Assistant Prof 7 yesMeena Professor 2 yesDinesh Associate Prof 7 yesDinu Assistant Prof 6 noAmar Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN teach = ‘yes’

Classifier(Model)


Classification Process (2): Use Classification Process (2): Use the Model in Predictionthe Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDSwati Assistant Prof 2 noMalika Associate Prof 7 noTina Professor 5 yesJune Assistant Prof 7 yes

Unseen Data

(Dina, Professor, 4)

Teach?


Model Construction: Example Model Construction: Example Sr. Gender Age BP Drug

1 M 20 Normal A

2 F 73 Normal B

3 M 37 High A

4 M 33 Low B

5 F 48 High A

6 M 29 Normal A

7 F 52 Normal B

8 M 42 Low B

9 M 61 Normal B

10 F 30 Normal A

11 F 26 Low B

12 M 54 High A


Model Construction: ExampleModel Construction: Example

Blood Pressure ?

Age ? Drug B

High Low

Drug A Drug B

Drug A

Normal

40 > 40

Directed Tree


Model Construction: ExampleModel Construction: Example

Tree summarizes the following:o If BP=High prescribe Drug A

o If BP=Low prescribe Drug B

o If BP=Normal and age 40 prescribe Drug A else prescribe

Drug B

Two classes ‘Drug A’ and ‘Drug B’ are created.


Model Construction: ExampleModel Construction: ExampleThe tree is constructed with training data and

there is no training error.

All rules that we made are 100% correct

according to training data.

In practical field data, it is unlikely that we get

rules with 100% accuracy and with high support.


Model Construction: ExampleModel Construction: ExampleAccuracy and Support :

o Accuracy is 100% correct for given rules.

o If BP=High prescribe Drug A ( Support = 3/12)

o If BP=Low prescribe Drug B ( Support = 3/12)

o If BP=Normal and age 40 prescribe Drug A else

prescribe Drug B ( Support = 3/12)


Error and SupportError and SupportLet t = No. of data points, r = no. of data points in a

class or node, max = maximum no. of data points in

a class or node, min = minimum no. of data points in

a class or nodeo Accuracy = max / r

o Error = min / r

o Support = max / t

Accuracy and Error are calculated for classes and

support for the class is calculated with respect to

the total number of data points in a given set.


Rules with different Rules with different accuracy & supportaccuracy & support

180 data points

115 A5 B

58 A2 B

X < 60 X > 60

E = 5/120A= 115/120S= 115/180

E = 2/60A= 58/60S= 58/180

Node P Node Q


Criteria to grow the treeCriteria to grow the tree

If the attribute is a categorical then the

tree is called as classification tree.

[ Eg. Drug Prescribe]

If the attribute is continuous then the tree

is called as regression tree.

[ Eg. Income]


CLASSIFICATION CLASSIFICATION TREES FOR TREES FOR

CATEGORICAL CATEGORICAL ATTRIBUTESATTRIBUTES


INDUCTION DECISION TREE [ ID3]INDUCTION DECISION TREE [ ID3]

Decision tree generation consists of two phaseso Tree construction

• At start, all the training examples are at the root• Partition examples recursively based on

selected attributeso Tree pruning

• Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sampleo Test the attribute values of the sample against the

decision tree


Training DatasetTraining Dataset

No age income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31…40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31…40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes

10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31…40 medium no excellent yes

This follows an example from Quinlan’s ID3


Output: ID 3 for “Output: ID 3 for “buys_computer”buys_computer”

age?

student? credit rating?

no yes fairexcellent

<=30>40

no noyes yes

yes

31..40

‘no’ and ‘yes’ are two classes created


ANOTHER EXAMPLE:ANOTHER EXAMPLE:MARKSMARKS

ӂ If x >= 90 then grade =A.ӂIf 80<=x<90 then grade =B.ӂIf 70<=x<80 then grade =C.ӂIf 60<=x<70 then grade =D.ӂIf x<50 then grade =F

SUSHIL KULKARNI

>=90<90

x

>=80<80

x

>=70<70

x B

A

>=60<50

x C

F D


ALGORITHM FOR ID 3ALGORITHM FOR ID 3

Basic algorithm (a greedy algorithm)o Tree is constructed in a top-down recursive divide-

and-conquer mannero At start, all the training examples are at the rooto Attributes are categorical o Samples are partitioned recursively based on

selected attributeso Test attributes are selected on the basis of a

heuristic or statistical measure (e.g., information gain)


ALGORITHM FOR ID 3ALGORITHM FOR ID 3

Conditions for stopping partitioningo All samples for a given node belong to the

same classo There are no remaining attributes for

further partitioning – majority voting is employed for classifying the leaf

o There are no samples left


ID 3 : ADVANTAGESID 3 : ADVANTAGES

Easy to understand.

Easy to generate rules


ID 3 :ID 3 :DISADVANTAGESDISADVANTAGES

May suffer from over fitting.

Does not easily handle nonnumeric data.

Can be quite large – pruning is necessary.


INFORMATION INFORMATION THEORYTHEORY


INFORMATION THEORYINFORMATION THEORY


When all the marbles in the bowl are mixed up, little information is given.

When the marbles in the bowl are distributed in different classes , more information is given.

INFORMATION THEORYINFORMATION THEORY


Entropy gives an idea of how to split an attribute from a tree.

‘yes’ or ‘no’ in our example.

ENTROPYENTROPY


BUILDING THE BUILDING THE TREETREE


Information Gain ID3Information Gain ID3

Select the attribute with the highest information

gain

Assume there are two classes, P and N

o Let the set S contain p elements of class P and n

elements of class N

o The amount of information, needed to decide if an

arbitrary object in S belongs to P or N is defined as

npn

npn

npp

npp

npI

22 loglog),(


Information Gain in Decision Information Gain in Decision Tree InductionTree Induction

Assume that using attribute A, a set S will be partitioned into sets {S1, S2 , …, Sv}

o If Si contains pi elements of P and ni elements of N,

the entropy, or the expected information needed to classify objects in all sub trees Si is

The encoding information that would be gained by branching on A

1),()(

iii

ii npInpnp

AE

)(),()( AEnpIAGain


Training DatasetTraining Dataset

No age income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31…40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31…40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes

10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31…40 medium no excellent yes

This follows an example from Quinlan’s ID3


Attribute Selection by Attribute Selection by Information Gain ComputationInformation Gain Computation

o Class P: buys_computer = “yes”

o Class N: buys_computer = “no”

o I(p, n) = I(9, 5) =0.940

o Compute the entropy for age:

Hence

Similarly

age pi ni I(pi, ni)<=30 2 3 0.97131..40 4 0 0>40 3 2 0.971

69.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIageE

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

250.0

)(),()(

ageEnpIageGain

AGE IS MAX GAINAGE IS MAX GAIN


Splitting the samples using Splitting the samples using ageage

income student credit_rating buys_computerhigh no fair nohigh no excellent nomedium no fair nolow yes fair yesmedium yes excellent yes

income student credit_rating buys_computerhigh no fair yeslow yes excellent yesmedium no excellent yeshigh yes fair yes

income student credit_rating buys_computermedium no fair yeslow yes fair yeslow yes excellent nomedium yes fair yesmedium no excellent no

age?<=30

31...40

>40

labeled yes


Output: ID 3 for “Output: ID 3 for “buys_computer”buys_computer”

age?

student? credit rating?

no yes fairexcellent

<=30>40

no noyes yes

yes

31..40


CARTCART


CART [ CLASSIFICATION AND CART [ CLASSIFICATION AND REGRESSION TREE] REGRESSION TREE]

Algorithm is similar to ID 3 but used GINI index called impurity measure to select variables.

If target variable is normal and has more than two categories , the option of merging of target categories into two super categories may be considered. The process is called Twoing.


GiniGini Index (IBM Index (IBM Intelligent Miner)Intelligent Miner)If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in

T.

n

jp jTgini

1

21)(


Extracting Classification Rules Extracting Classification Rules from Treesfrom Trees

Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a conjunction


Extracting Classification Rules Extracting Classification Rules from Treesfrom Trees

The leaf node holds the class prediction

Rules are easy for humans to understand

Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”


BAYESIAN BAYESIAN CLASSIFICATIONCLASSIFICATION


Classification and Classification and regressionregression

What is classification? What is regression?

Issues regarding classification and

regression

Classification by decision tree induction

Bayesian Classification

Other Classification Methods

regression


What is Bayesian What is Bayesian Classification?Classification?

Bayesian classifiers are statistical classifiers

For each new sample they provide a probability that the sample belongs to a class (for all classes)


What is Bayesian What is Bayesian Classification?Classification?

Example:

o sample John (age=27, income=high,

student=no, credit_rating=fair)

o P(John, buys_computer=yes) = 20%

o P(John, buys_computer=no) = 80%


Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample

Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

play tennis?



Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Nrain cool normal true Nsunny mild high false Nrain mild high true N

Outlook Temperature Humidity Windy Classovercast hot high false Prain mild high false Prain cool normal false Povercast cool normal true Psunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false P

9

5



Given the training set, we compute the probabilities:

We also have the probabilities P = 9/14 and N = 5/14

Outlook P N Humidity P Nsunny 2/9 3/5 high 3/9 4/5overcast 4/9 0 normal 6/9 1/5rain 3/9 2/5Tempreature Windyhot 2/9 2/5 true 3/9 3/5mild 4/9 2/5 false 6/9 2/5cool 3/9 1/5


Naive Bayesian ClassifierNaive Bayesian Classifier

We use notation P(A) as the probability of an event A and P(A/B) denotes the probability of A conditional on another event B. H is the hypothesis and E is the evidence and is the combination of attribute values then

)E(p

)H(p).H/E(p)E/H(p

Example: Let H be ‘yes’ and e is the combination of attribute values for new day: Outlook=sunny, temp.= cool, humidity= high, windy= true. Call these for pieces as E1 , E2 ’ E 3 and E 4 and are independent then

)E(p

)H(p).H/4E(p)H/3E(p).H/2E(p).H/1E(p)E/H(p


Naive Bayesian ClassifierNaive Bayesian Classifier

Denominator can be eliminated as the final normalizing step when we make the probabilities of different pieces, the sum is 1. Thus

)H(p).H/4E(p)H/3E(p).H/2E(p).H/1E(p)E/H(p



To classify a new day E: outlook = sunny, temperature = cool

humidity = high, windy = false

Prob(P|E) = Prob(P) * Prob(sunny|P) * Prob(cool|P)

* Prob(high|P) * Prob(false|P) = 9/14*2/9*3/9*3/9*6/9 = 0.01

Prob(N|X) = Prob(N) * Prob(sunny|N) * Prob(cool|N) * Prob(high|N) * Prob(false|N) = 5/14*3/5*1/5*4/5*2/5 = 0.013



Probability of ‘Playing’

Probability of ‘ Not Playing’

Therefore E takes class label N

%57013.001.0

013.0

%43013.001.0

01.0


Second example X = <rain, hot, high, false>

P(X|p) · P(p) = P(rain|p) * P(hot|p) * P(high|p)

* P(false|p) * P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

P(X|n) · P(n) = P(rain|n) · P(hot|n) · P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

Sample X is classified in class N (don’t play)




Probability of ‘Playing’

Probability of ‘ Not Playing’

Therefore X takes class label N

%630182860.0010582.0

0182860.0

%370182860.0010582.0

010582.0


REGRESSIONREGRESSION


What Is regression?What Is regression?regression is similar to classification

o First, construct a model

o Second, use model to predict unknown value

Major method for regression is regression

• Linear and multiple regression

• Non-linear regression

regression is different from classification

o Classification refers to predict categorical class label

o regression models continuous-valued functions


Predictive modeling: Predict data values or construct generalized linear models based on the database data.

One can only predict value ranges or category distributions

Determine the major factors which influence the regressiono Data relevance analysis: uncertainty

measurement, entropy analysis, expert judgement, etc.

Predictive Modeling in Predictive Modeling in DatabasesDatabases


Regress Analysis and Log-Regress Analysis and Log-Linear Models in RegressionLinear Models in Regression Linear regression: Y = + X

o Two parameters , and specify the line and are to be estimated by using the data at hand.

o using the least squares criterion to the known values of (x1,y1),(x2,y2),...,(xs,yS):

s

i i

s

i ii

xx

yyxx

1

2

1

)(

))(( xya


Regress Analysis and Log-Regress Analysis and Log-Linear Models in RegressionLinear Models in Regression Multiple regression: Y = b0 + b1 X1 + b2 X2.o Many nonlinear functions can be transformed into

the above. o E.g.,Y=b 0 + b1 X+ b2X 2+ b3X 3, X1=X, X2=X 2, X3=X 3

Log-linear models:o The multi-way table of joint probabilities is

approximated by a product of lower-order tables.o Probability: p(a, b, c, d) = ab acad bcd


T H A N K S !T H A N K S !

Documents

Classification algorithms used in Data Mining. This is a lecture given to Msc students